Introduction

Baby Mario once had a dream to drop math knowledge on high school students. High schools are evaluated based on student achievement on standardized test scores. The self-proclaimed king, Lebron James, intends to donate $5 million to the district and funds will be distributed according to the improvement of these high schools in the area of mathematics. Because of the king’s proclamation, the bureaucratic district heads have pushed high schools to redirect all loose change to mathematics. Teachers of english, science, history, and other subjects have been pressured to dedicate portions of class time to math education. Teachers of music and the arts have been released because the king does not have time for that nonsense. The dreams of these children along with the king’s talents have been taken to Hollywood.

The Cleveland school district is made up of 20 high schools uniquely identified by numbers 1 to 20. Within each high school, a random sample of 20 students are selected to participate in this study and are uniquely identified by numbers 1 to 20. All 400 students selected are assessed on their mathematics skills based on district designed standardized tests in the years 2017 and 2018. The scores and corresponding percentiles of these selected students for both years are simulated in the following R code.

school.id=rep(1:20,each=20*2)
student.id=rep(rep(1:20,each=2),20)
type=rep(c("Score","Percentile"),20*20)
score2017=round(rnorm(20*20,mean=50,sd=10),0)
percentile2017=round(100*pnorm(score2017,mean=mean(score2017),sd=sd(score2017)),0)
score2018=round(rnorm(20*20,mean=75,sd=4),0)
percentile2018=round(100*pnorm(score2018,mean=mean(score2018),sd=sd(score2018)),0)
value2017=c(rbind(score2017,percentile2017))
value2018=c(rbind(score2018,percentile2018))

untidy.school = tibble(
                  school=school.id,
                  student=student.id,
                  type=type,
                  value2017=value2017,
                  value2018=value2018) %>% 
                filter(!(school==1 & student==4)) %>% filter(!(school==12 & student==18)) %>%
                mutate(value2018=ifelse((school==1 & student==3)|(school==15 & student==18)|
                                          (school==5 & student==12),NA,value2018))

Below is an HTML table generated using the xtable package in R. For more information regarding this package, see the xtable gallery. The R code in the code chunk converts an R data frame object to an HTML table. HTML table attributes can be specified within the function print(). The code chunk option echo=F prevents the code from showing and the option results="asis" ensures that the resulting HTML table is displayed when knitted to HTML. The table provides a preview of the first 10 rows of the simulated data. Knit the document to see the HTML table below.

school	student	type	value2017	value2018
1	1	Score	71	76
1	1	Percentile	98	60
1	2	Score	80	75
1	2	Percentile	100	50
1	3	Score	45
1	3	Percentile	34
1	5	Score	45	70
1	5	Percentile	34	11
1	6	Score	51	70
1	6	Percentile	57	11

Along with the schools’ inability to appropriately drop math knowledge on these kids is their inability to record data in a format that is immediately usable. Using our understanding of the tidyr package, we can easily convert this table into a form that is useful for data analysis.

Part 1: Creation of a Unique Student ID

The variable school uniquely identifies the school, but the variable student only uniquely identifies the student within the school. The problem is best illustrated by the filter() function in dplyr.

untidy.school %>% filter(student==1) %>% head(4)

## # A tibble: 4 × 5
##   school student type       value2017 value2018
##    <int>   <int> <chr>          <dbl>     <dbl>
## 1      1       1 Score             71        76
## 2      1       1 Percentile        98        60
## 3      2       1 Score             47        77
## 4      2       1 Percentile        42        69

1.1: Create a unique identifier (1.5 Points)

The subsetted table contains scores and percentiles for two completely different children identified by student==1. We need to create a unique identifier for each student in the Cleveland school district. The unite() function can be utilized to create a new variable called CID by concatenating the identifiers for school and student. We want CID to follow the general form SCHOOL.STUDENT. For example, the CID for the first student in the table above would be “1.1”.

Create a new tibble called untidy2.school that fixes this problem without dropping the original variables school or student. Read the documentation for unite() either by searching on google or using ?unite to prevent the loss of original variables in the creation of a new variable.

untidy2.school = untidy.school %>%
                    unite(COMPLETE)

glimpse(untidy2.school) #Do Not Change Lines with the glimpse Function

Part 2: Gather Variables With Yearly Values

2.1: Create a unique identifier (1.5 Points)

The variables value2017 and value2018 contain the scores and percentiles for two different years. In a new tibble called untidy3.school, based on untidy2.school, we want to create a new variable called Year and a new variable called Value that display the year and the result from that year, respectively. The variable Year should be a numeric vector containing either 2017 or 2018. The most efficient way to modify the data in this manner is to start by renaming value2017 and value2018 to nonsynctactic names 2017 and 2018. Remember that you need to surround nonsyncactic names with backticks to achieve this result.

untidy3.school = untidy2.school %>%
                    rename(NEWVAR=OLDVAR,NEWVAR=OLDVAR) %>%
                    gather(`2017`:`2018`,COMPLETE)
glimpse(untidy3.school)

Part 3: Spread Type of Value Into Multiple Columns

3.1: Create new variables using spread (1 point)

The variable type in untidy3.school indicates that two completely different variables are contained in the recently created variable called Value. Both the scores and percentiles of students are contained in Value. Using the function spread() we can create two new variables, Score and Percentile, that display the information contained in Value in separate columns. Using untidy3.school, create a new tibble called tidy.school that accomplishes these tasks.

tidy.school = untidy3.school %>%
                    spread(COMPLETE) 
glimpse(tidy.school)

Part 4: Missing Data Analysis

The original data contains explicitly missing and implicitly missing values. Instances of both can be visibly seen in the first ten observations. Below is a table showing the first 10 observations in the cleaned dataset we called tidy.school. To appropriately, view this we have to sort our observations by school and student as seen in the original dataset untidy.school. Knit the document to see the HTML table below.

Based on the table above, you can see that student 3 from school 1 has a missing score and percentile for the year 2018. This is an example of explicitly missing information. In logitudinal studies where measures are taken on individuals over a fixed time period, this is a common occurrence. Hypothesize a reason for this scenario to happen.

Based on the table above, you can see that student 4 from school 1 is clearly missing scores and percentiles from both years 2017 and 2018. This is an example of implicitly missing information. This is a less common occurrence, but try to hypothesize a reason for this scenario to happen.

4.1 Convert Implicitly Missing to Explicitly Missing (1 Point)

Use the complete() function to convert all implicitly missing to explicitly missing. Create a new table called tidy2.school that reports missing values as NA for all combinations of school, student, and year.

tidy2.school=tidy.school %>%
  complete(VARIABLES)

The first 10 rows of tidy2.school are displayed below.

tab.tidy2.school = tidy2.school %>%
  head(10) %>%
  xtable(digits=0,align="ccccccc")

print(tab.tidy2.school,type="html",include.rownames=F,
      html.table.attributes="align='center',
      rules='rows',
      width=50%,
      frame='hsides',
      border-spacing=5px"
)

4.2 Fixing the Data Using the Pipe (2 Points)

If you inspect the first 10 rows of tidy2.school, you should see that the variable CID is missing for student 4 from school 1 even though we know that this students unique district ID should be “1.4”. Using the pipe %>%, combine all previous statements in an order where this will not occur. Create a tibble named final.tidy.school using a chain of commands that begins with calling the original tibble untidy.school

final.tidy.school = untidy.school %>%
                      MORE %>%
                      MORE %>%
                      ...

glimpse(final.tidy.school)

Part 5: Summarizing Figures

5.1: Follow-up question about money distribution (1 Point)

The figure below uses boxplots to show the distribution of scores in the 20 schools for the years 2017 and 2018.

ggplot(final.tidy.school) +
  geom_boxplot(aes(x=as.factor(Year),y=Score,fill=as.factor(school))) + 
  guides(fill=F)+
  theme_minimal()

Question: If you were the district superintendent, how would you distribute the King’s ransom to these schools? In other words, how should Lebron James distribute the money to the schools. Write your answer in complete sentences below:

5.2: Follow-up question about inspecting the improvement (2 Points)

Using different colors for each student, the next two pictures show the change in test scores and percentiles for all students (without missing values) sampled from the district. Both of these pictures are necessary in understanding the improvement in mathematical knowledge on the student level. As you can see, they are very different from each other. Hypothesize an educational strategy the district might have taken that would have caused this phenomenon to occur.

ggplot(final.tidy.school) + 
  geom_line(aes(x=Year,y=Score,color=as.factor(CID))) +
  guides(color=F) +
  scale_x_discrete(breaks=c(2017,2018),labels=c(2017,2018)) +
  theme_minimal()

ggplot(final.tidy.school) + 
  geom_line(aes(x=Year,y=Percentile,color=as.factor(CID))) +
  guides(color=F) +
  scale_x_discrete(breaks=c(2017,2018),labels=c(2017,2018)) +
  theme_minimal()

Question: Do you believe the students actually improved in their math knowledge? Why or why not? The schools knew that they were going to get money based off the improvement of the math scores. What do we learn about the grades of the students when compared to each other. Think about this and write your answer in a well-written paragraph below:

Lab 4: Tidy Data

Mario Giacomazzo

February 11, 2025