In this lab assignment, make sure you work in order of the code chunks, and knit after you complete each code chunk.
Consider the dataset Wages1
from the Ecdat
package.
## exper sex school wage
## 1 9 female 13 6.315296
## 2 12 female 12 5.479770
## 3 11 female 11 3.642170
## 4 9 female 14 4.593337
## 5 8 female 14 2.418157
## 6 9 female 14 2.094058
This observational dataset records the years experienced, the years schooled, the sex, and the hourly wage for 3,294 workers in 1987. A Guide to Modern Econometrics by Marno Verbeek utilizes this data in a linear regression context. According to Marno Verbeek, this data is a subsample from the US National Longitudinal Study. The purpose of this tutorial is to practice the creative process in exploratory data analysis of asking questions and then investigating those questions using visuals and statistical summaries.
As a member of the birth class of 1988, I do not have any clue of
what the workforce looked like in 1987. It is your job to apply your
detective skills to the information hidden in this data. For future use,
utilize the modified datasetwage
according to the R code
below:
wage=as.tibble(Wages1) %>%
rename(experience=exper) %>%
arrange(school)
## Warning: `as.tibble()` was deprecated in tibble 2.0.0.
## ℹ Please use `as_tibble()` instead.
## ℹ The signature and semantics have changed, see `?as_tibble`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
head(wage)
## # A tibble: 6 × 4
## experience sex school wage
## <int> <fct> <int> <dbl>
## 1 18 male 3 5.52
## 2 15 male 4 3.56
## 3 18 male 4 9.10
## 4 10 female 5 0.603
## 5 11 male 5 3.80
## 6 14 male 5 7.50
First, use geom_bar()
to investigate the distribution of
level of experience found in wage
.
#
Use group_by(experience)
along with the pipe
%>%
to output the most common amount of years of
experience along with the number of occurrences found in the data. The
most common value for years of experience is _____ and occurs _____
times. Fill in the blanks with the correct answers, and change
eval=True
to eval=False
and print out the
output that led you to your answer.
wage %>%
group_by(experience) %>%
COMPLETE
First, use geom_bar()
to visualize the overall
distribution of level of schooling found in the data.
#
Next, modify the code in Question 1.2 to display the maximum level of schooling and the number of workers in the data that had that number of schooling. The maximum number of years in school was ____ years which occurred _____ times in our sample. Fill in the blanks with the correct answers.
#
Use geom_point()
to display a scatter plot representing
the relationship between these two discrete numeric variables. Consider
using alpha=0.1
to indicate where the relationship is
represented the best.
The years of experience seem to _____ (increase/decrease) as the years of schooling increases. Is this what you expected to see? ____ (yes/no).
#
Question: Practically, what reasons do you hypothesize for this observed relationship? Write your answer in complete sentences below:
Use geom_freqpoly()
to compare the distribution of wage
of females to the distribution of wage of males.
#
Question: Where do these distributions look the same and/or where do they differ? Write your answer in complete sentences below:
Use group_by()
along with summarize to report the mean
wage
, standard error of wage
, and 95%
confidence interval for the unknown population mean hourly wage for the
various levels of sex
. The standard error is equal to the
standard deviation divided by the square root of the sample size. The
95% confidence interval is approximated by obtaining the lower and upper
bound of an interval within 2 standard errors of the sample mean. Based
on the confidence limits, do we have statistical evidence to say that
the average hourly wage for men was different than the average hourly
wage for women in 1987? ______ (yes/no).
wage %>%
group_by(sex) %>%
summarize(n=n(),mean= COMPLETE,se=COMPLETE,
lb=COMPLETE,ub=COMPLETE)
Question: How would you explain your answer in terms of the confidence intervals that are constructed below? Write your answer in complete sentences below:
wage %>%
group_by(sex) %>%
summarize(n=n(),mean=mean(wage),se=sd(wage)/sqrt(n),
lb=mean-2*se,ub=mean+2*se)
## # A tibble: 2 × 6
## sex n mean se lb ub
## <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 female 1569 5.15 0.0726 5.00 5.29
## 2 male 1725 6.31 0.0842 6.14 6.48
Use geom_point()
along with the option
color=sex
to overlay scatter plots. Does there seem to be a
clear distinction between female and male regarding this relationship?
______ (yes/no).
#
Repeat the graphic created in Question 4 replacing
x=experience
with x=school
. Does there seem to
be a clear distinction between female and male regarding this
relationship? ______ (yes/no).
#
The graphic below summarizes the average hourly wage for the
different combinations of schooling and experience level. The additional
facet_grid(~sex)
makes comparing the relationship of the
three key numeric variables between the sexes quite easy.
wage %>%
group_by(experience,school,sex) %>%
summarize(n=n(),mean=mean(wage)) %>%
ungroup() %>%
ggplot() +
geom_tile(aes(x=experience,y=school,fill=mean)) +
scale_fill_gradientn(colors=c("black","lightskyblue","white"))+
facet_grid(~sex) + theme_dark()
## `summarise()` has grouped output by 'experience', 'school'. You can override
## using the `.groups` argument.
Question: What are some differences between the sexes regarding this relationship that are apparent in this chart? Write your answer in complete sentences below:
The next figure is similar to the previous one except that the tile color reflects the standard deviation of wage rather than the mean. Interactions of experience and school levels containing less than or equal to 10 instances are ignored in this image.
wage %>%
group_by(experience,school,sex) %>%
summarize(n=n(),sd=sd(wage)) %>%
ungroup() %>%
filter(n>10) %>%
ggplot() +
geom_tile(aes(x=experience,y=school,fill=sd)) +
scale_fill_gradientn(colors=c("black","lightskyblue","white"))+
facet_grid(~sex) + theme_dark()
## `summarise()` has grouped output by 'experience', 'school'. You can override
## using the `.groups` argument.
Question: Which plot is generally darker and what does that imply? Write your answer in complete sentences below:
Question: Specifically for the scenario where a worker has 5 years of experience and 11 years of schooling, what does the extreme contrast between female and male cells imply for this figure? Write your answer in complete sentences below:
Baby Mario once had a dream to drop math knowledge on high school students. High schools are evaluated based on student achievement on standardized test scores. The self-proclaimed king, Lebron James, intends to donate $5 million to the district and funds will be distributed according to the improvement of these high schools in the area of mathematics. Because of the king’s proclamation, the bureaucratic district heads have pushed high schools to redirect all loose change to mathematics. Teachers of english, science, history, and other subjects have been pressured to dedicate portions of class time to math education. Teachers of music and the arts have been released because the king does not have time for that nonsense. The dreams of these children along with the king’s talents have been taken to Hollywood.
The Cleveland school district is made up of 20 high schools uniquely identified by numbers 1 to 20. Within each high school, a random sample of 20 students are selected to participate in this study and are uniquely identified by numbers 1 to 20. All 400 students selected are assessed on their mathematics skills based on district designed standardized tests in the years 2017 and 2018. The scores and corresponding percentiles of these selected students for both years are simulated in the following R code.
school.id=rep(1:20,each=20*2)
student.id=rep(rep(1:20,each=2),20)
type=rep(c("Score","Percentile"),20*20)
score2017=round(rnorm(20*20,mean=50,sd=10),0)
percentile2017=round(100*pnorm(score2017,mean=mean(score2017),sd=sd(score2017)),0)
score2018=round(rnorm(20*20,mean=75,sd=4),0)
percentile2018=round(100*pnorm(score2018,mean=mean(score2018),sd=sd(score2018)),0)
value2017=c(rbind(score2017,percentile2017))
value2018=c(rbind(score2018,percentile2018))
untidy.school = tibble(
school=school.id,
student=student.id,
type=type,
value2017=value2017,
value2018=value2018) %>%
filter(!(school==1 & student==4)) %>% filter(!(school==12 & student==18)) %>%
mutate(value2018=ifelse((school==1 & student==3)|(school==15 & student==18)|
(school==5 & student==12),NA,value2018))
Below is an HTML table generated using the xtable
package in R. For more information regarding this package, see the xtable
gallery. The R code in the code chunk converts an R data frame
object to an HTML table. HTML table attributes can be specified within
the function print()
. The code chunk option
echo=F
prevents the code from showing and the option
results="asis"
ensures that the resulting HTML table is
displayed when knitted to HTML. The table provides a preview of the
first 10 rows of the simulated data. Knit the document to see the HTML
table below.
school | student | type | value2017 | value2018 |
---|---|---|---|---|
1 | 1 | Score | 37 | 77 |
1 | 1 | Percentile | 12 | 69 |
1 | 2 | Score | 55 | 81 |
1 | 2 | Percentile | 72 | 93 |
1 | 3 | Score | 35 | |
1 | 3 | Percentile | 8 | |
1 | 5 | Score | 54 | 74 |
1 | 5 | Percentile | 69 | 41 |
1 | 6 | Score | 46 | 79 |
1 | 6 | Percentile | 38 | 84 |
Along with the schools’ inability to appropriately drop math
knowledge on these kids is their inability to record data in a format
that is immediately usable. Using our understanding of the
tidyr
package, we can easily convert this table into a form
that is useful for data analysis.
The variable school
uniquely identifies the school, but
the variable student
only uniquely identifies the student
within the school. The problem is best illustrated by the
filter()
function in dplyr
.
untidy.school %>% filter(student==1) %>% head(4)
## # A tibble: 4 × 5
## school student type value2017 value2018
## <int> <int> <chr> <dbl> <dbl>
## 1 1 1 Score 37 77
## 2 1 1 Percentile 12 69
## 3 2 1 Score 60 75
## 4 2 1 Percentile 86 51
The subsetted table contains scores and percentiles for two
completely different children identified by student==1
. We
need to create a unique identifier for each student in the Cleveland
school district. The unite()
function can be utilized to
create a new variable called CID
by concatenating the
identifiers for school
and student
. We want
CID
to follow the general form SCHOOL.STUDENT. For
example, the CID
for the first student in the table above
would be “1.1”.
Create a new tibble called untidy2.school
that fixes
this problem without dropping the original variables school
or student
. Read the documentation for unite()
either by searching on google or using ?unite
to prevent
the loss of original variables in the creation of a new variable.
untidy2.school = untidy.school %>%
unite(COMPLETE)
glimpse(untidy2.school) #Do Not Change Lines with the glimpse Function
The variables value2017
and value2018
contain the scores and percentiles for two different years. In a new
tibble called untidy3.school
, based on
untidy2.school
, we want to create a new variable called
Year
and a new variable called Value
that
display the year and the result from that year, respectively. The
variable Year
should be a numeric vector containing either
2017 or 2018. The most efficient way to modify the
data in this manner is to start by renaming value2017
and
value2018
to nonsynctactic names 2017
and
2018
. Remember that you need to surround nonsyncactic names
with backticks to achieve this result.
untidy3.school = untidy2.school %>%
rename(NEWVAR=OLDVAR,NEWVAR=OLDVAR) %>%
gather(`2017`:`2018`,COMPLETE)
glimpse(untidy3.school)
The variable type
in untidy3.school
indicates that two completely different variables are contained in the
recently created variable called Value
. Both the scores and
percentiles of students are contained in Value
. Using the
function spread()
we can create two new variables,
Score
and Percentile
, that display the
information contained in Value
in separate columns. Using
untidy3.school
, create a new tibble called
tidy.school
that accomplishes these tasks.
tidy.school = untidy3.school %>%
spread(COMPLETE)
glimpse(tidy.school)
The original data contains explicitly missing and implicitly missing
values. Instances of both can be visibly seen in the first ten
observations. Below is a table showing the first 10 observations in the
cleaned dataset we called tidy.school
. To appropriately,
view this we have to sort our observations by school
and
student
as seen in the original dataset
untidy.school
. Knit the document to see the HTML table
below.
Based on the table above, you can see that student 3 from school 1 has a missing score and percentile for the year 2018. This is an example of explicitly missing information. In logitudinal studies where measures are taken on individuals over a fixed time period, this is a common occurrence. Hypothesize a reason for this scenario to happen.
Based on the table above, you can see that student 4 from school 1 is clearly missing scores and percentiles from both years 2017 and 2018. This is an example of implicitly missing information. This is a less common occurrence, but try to hypothesize a reason for this scenario to happen.
Use the complete()
function to convert all implicitly
missing to explicitly missing. Create a new table called
tidy2.school
that reports missing values as NA
for all combinations of school, student, and year.
tidy2.school=tidy.school %>%
complete(VARIABLES)
The first 10 rows of tidy2.school
are displayed
below.
tab.tidy2.school = tidy2.school %>%
head(10) %>%
xtable(digits=0,align="ccccccc")
print(tab.tidy2.school,type="html",include.rownames=F,
html.table.attributes="align='center',
rules='rows',
width=50%,
frame='hsides',
border-spacing=5px"
)
If you inspect the first 10 rows of tidy2.school
, you
should see that the variable CID
is missing for student
4 from school 1 even though we know that this students
unique district ID should be “1.4”. Using the pipe
%>%
, combine all previous statements in an order where
this will not occur. Create a tibble named
final.tidy.school
using a chain of commands that begins
with calling the original tibble untidy.school
final.tidy.school = untidy.school %>%
MORE %>%
MORE %>%
...
glimpse(final.tidy.school)
The figure below uses boxplots to show the distribution of scores in the 20 schools for the years 2017 and 2018.
ggplot(final.tidy.school) +
geom_boxplot(aes(x=as.factor(Year),y=Score,fill=as.factor(school))) +
guides(fill=F)+
theme_minimal()
Question: If you were the district superintendent, how would you distribute the King’s ransom to these schools? In other words, how should Lebron James distribute the money to the schools. Write your answer in complete sentences below:
Using different colors for each student, the next two pictures show the change in test scores and percentiles for all students (without missing values) sampled from the district. Both of these pictures are necessary in understanding the improvement in mathematical knowledge on the student level. As you can see, they are very different from each other. Hypothesize an educational strategy the district might have taken that would have caused this phenomenon to occur.
ggplot(final.tidy.school) +
geom_line(aes(x=Year,y=Score,color=as.factor(CID))) +
guides(color=F) +
scale_x_discrete(breaks=c(2017,2018),labels=c(2017,2018)) +
theme_minimal()
ggplot(final.tidy.school) +
geom_line(aes(x=Year,y=Percentile,color=as.factor(CID))) +
guides(color=F) +
scale_x_discrete(breaks=c(2017,2018),labels=c(2017,2018)) +
theme_minimal()
Question: Do you believe the students actually improved in their math knowledge? Why or why not? The schools knew that they were going to get money based off the improvement of the math scores. What do we learn about the grades of the students when compared to each other. Think about this and write your answer in a well-written paragraph below: