Introduction

The main purpose of this lab is to practice data join skills from Chapter 10. The functions and their purposes are listed as follows:

You will need to modify the code chunks so that the code works within each of chunk (usually this means modifying anything in ALL CAPS). You will also need to modify the code outside the code chunk. When you get the desired result for each step, change Eval=F to Eval=T and knit the document to HTML to make sure it works. After you complete the lab, you should submit your HTML file of what you have completed to Canvas before the deadline.

Exercises

Part 1

In part 1, you will practice the skills using the datasets from the R package Lahman. This database includes data related to baseball teams. It includes summary statistics about how the players performed on offense and defense for several years. It also includes personal information about the players.

The Batting data frame contains the offensive statistics for all players for many years. You can see, for example, the top 10 hitters in 2016 by running this code: (For more details of the dataset run ?Batting in console.)

top <- Batting %>% 
  filter(yearID == 2016) %>%
  arrange(desc(HR)) %>%
  slice(1:10)

top #Do Not Remove
##     playerID yearID stint teamID lgID   G  AB   R   H X2B X3B HR RBI SB CS BB
## 1  trumbma01   2016     1    BAL   AL 159 613  94 157  27   1 47 108  2  0 51
## 2   cruzne02   2016     1    SEA   AL 155 589  96 169  27   1 43 105  2  0 62
## 3  daviskh01   2016     1    OAK   AL 150 555  85 137  24   2 42 102  1  2 42
## 4  doziebr01   2016     1    MIN   AL 155 615 104 165  35   5 42  99 18  2 61
## 5  encared01   2016     1    TOR   AL 160 601  99 158  34   0 42 127  2  0 87
## 6  arenano01   2016     1    COL   NL 160 618 116 182  35   6 41 133  2  3 68
## 7  cartech02   2016     1    MIL   NL 160 549  84 122  27   1 41  94  3  1 76
## 8  frazito01   2016     1    CHA   AL 158 590  89 133  21   0 40  98 15  5 64
## 9  bryankr01   2016     1    CHN   NL 155 603 121 176  35   3 39 102  8  5 75
## 10  canoro01   2016     1    SEA   AL 161 655 107 195  33   2 39 103  0  1 47
##     SO IBB HBP SH SF GIDP
## 1  170   1   3  0  0   14
## 2  159   5   9  0  7   15
## 3  166   0   8  0  5   19
## 4  138   6   8  2  5   12
## 5  138   3   5  0  8   22
## 6  103  10   2  0  8   17
## 7  206   1   9  0 10   18
## 8  163   1   4  1  7   11
## 9  154   5  18  0  3    3
## 10 100   8   8  0  5   18

But who are these players? We see an ID, but not the names. The player names are in this table named People.

head(People,5)
##    playerID birthYear birthMonth birthDay birthCountry birthState  birthCity
## 1 aardsda01      1981         12       27          USA         CO     Denver
## 2 aaronha01      1934          2        5          USA         AL     Mobile
## 3 aaronto01      1939          8        5          USA         AL     Mobile
## 4  aasedo01      1954          9        8          USA         CA     Orange
## 5  abadan01      1972          8       25          USA         FL Palm Beach
##   deathYear deathMonth deathDay deathCountry deathState deathCity nameFirst
## 1        NA         NA       NA         <NA>       <NA>      <NA>     David
## 2      2021          1       22          USA         GA   Atlanta      Hank
## 3      1984          8       16          USA         GA   Atlanta    Tommie
## 4        NA         NA       NA         <NA>       <NA>      <NA>       Don
## 5        NA         NA       NA         <NA>       <NA>      <NA>      Andy
##   nameLast      nameGiven weight height bats throws      debut  finalGame
## 1  Aardsma    David Allan    215     75    R      R 2004-04-06 2015-08-23
## 2    Aaron    Henry Louis    180     72    R      R 1954-04-13 1976-10-03
## 3    Aaron     Tommie Lee    190     75    R      R 1962-04-10 1971-09-26
## 4     Aase Donald William    190     75    R      R 1977-07-26 1990-10-03
## 5     Abad  Fausto Andres    184     73    L      L 2001-09-10 2006-04-13
##    retroID   bbrefID  deathDate  birthDate
## 1 aardd001 aardsda01       <NA> 1981-12-27
## 2 aaroh101 aaronha01 2021-01-22 1934-02-05
## 3 aarot101 aaronto01 1984-08-16 1939-08-05
## 4 aased001  aasedo01       <NA> 1954-09-08
## 5 abada001  abadan01       <NA> 1972-08-25

We can see column names nameFirst and nameLast in table People.

1.1 (1 Point)

Use the left_join function to create a data frame called top1, which contains information of the 10 top home run hitters. The table should have the following columns: playerID, nameFirst, nameLast, and number of home runs (HR).

top1 = top %>% 
  left_join(DATA, by = KEY) %>%
  select(VARIABLES)

top1 #Do Not Remove

1.2 (2 Points)

Data Salaries contains the baseball player salary data.

head(Salaries,5)
##   yearID teamID lgID  playerID salary
## 1   1985    ATL   NL barkele01 870000
## 2   1985    ATL   NL bedrost01 550000
## 3   1985    ATL   NL benedbr01 545000
## 4   1985    ATL   NL  campri01 633333
## 5   1985    ATL   NL ceronri01 625000

You may be curious about the salaries of the top 10 hitters in 2016:

  • Now create a new data frame called top2 by adding top 10 hitters’ salaries to top1 and including only nameFirst, nameLast, teamID, HR, and salary columns.
  • Rename the columns to FirstName, LastName, Team, Homeruns and Salary respectively.
  • Arrange the data frame by Salary in descending order.

Note that salaries are different every year so make sure to filter for the year 2016. This time, only use right_join to complete the exercise.

top2 = DATA %>%
  filter(CONDITION) %>%
  right_join(DATA,by=KEY) %>%
  select(SELECT_AND_RENAME) %>%
  arrange(desc(VAR))

top2   #Do Not Remove

Part 2

In this part, we will explore relational data from nycflights13, which contains four data frames related to the flights table that you used in previous assignments.

2.1 (2 Points)

Data airports gives information about each airport, such as latitude and longitude, identified by the faa airport code.

head(airports,5)
## # A tibble: 5 × 8
##   faa   name                            lat   lon   alt    tz dst   tzone       
##   <chr> <chr>                         <dbl> <dbl> <dbl> <dbl> <chr> <chr>       
## 1 04G   Lansdowne Airport              41.1 -80.6  1044    -5 A     America/New…
## 2 06A   Moton Field Municipal Airport  32.5 -85.7   264    -6 A     America/Chi…
## 3 06C   Schaumburg Regional            42.0 -88.1   801    -6 A     America/Chi…
## 4 06N   Randall Airport                41.4 -74.4   523    -5 A     America/New…
## 5 09J   Jekyll Island Airport          31.1 -81.4    11    -5 A     America/New…

Based on flights, compute the average arrival delay by destination (dest) and ignore missing values, then join on the airports data frame. Only use an inner join.

delay = DATA %>%
  group_by(VAR) %>%
  summarise(avg_arr_delay=COMPUTATION,.groups='drop') %>%
  inner_join(DATA,by=KEY) %>%
  select(VARS)
delay #Do Not Remove

2.2 (1 Points)

Draw a scatterplot with dots representing destination locations and colors of dots representing average arrival delay on US map. We do this to look at the spatial distribution of average arrival delay.

delay %>%
  ggplot(aes(SPECIFY_AESTHETICS)) +
    borders("state") +
    geom_point() +
    coord_quickmap()

2.3 (1 Points)

Data planes gives information about each plane, identified by its tailnum. Note that year column in planes represents the year a plane was manufactured, which is different from year column in flights.

head(planes,5)
## # A tibble: 5 × 9
##   tailnum  year type               manufacturer model engines seats speed engine
##   <chr>   <int> <chr>              <chr>        <chr>   <int> <int> <int> <chr> 
## 1 N10156   2004 Fixed wing multi … EMBRAER      EMB-…       2    55    NA Turbo…
## 2 N102UW   1998 Fixed wing multi … AIRBUS INDU… A320…       2   182    NA Turbo…
## 3 N103US   1999 Fixed wing multi … AIRBUS INDU… A320…       2   182    NA Turbo…
## 4 N104UW   1999 Fixed wing multi … AIRBUS INDU… A320…       2   182    NA Turbo…
## 5 N10575   2002 Fixed wing multi … EMBRAER      EMB-…       2    55    NA Turbo…

Use the planes data to calculate the age of planes, assuming current year is 2013. Keep only tailnum and age in the output table plane_ages.

plane_ages <- 
  DATA %>%
  mutate(age = FUNCTION) %>%
  select(VARS)
plane_ages #Do Not Change

2.4 (2 Points)

Is there a relationship between the age of a plane and its delays?

  • Join the plane_ages with flights, keeping observations with matches in both datasets.
  • Summarize the average departure delay by plane age and ignore missing values.
  • Draw a scatterplot of plane age vs. average departure delay.
DATA %>%
  inner_join(DATA, by = KEY) %>%
  group_by(VAR) %>%
  summarise(avg_dep_delay = FUNCTION, .groups='drop') %>%
  ggplot(aes(SPECIFY_AESTHETICS)) +
  geom_point()

2.5 (0.5 Points)

What does it mean for a flight to have a missing tailnum?

flights %>%
  filter(is.na(VAR))

Answer: ANSWER_HERE

2.6 (0.5 Points)

What do the tail numbers that don’t have a matching record in planes have in common? (Hint: one variable explains ~90% of the problems. Check the documentation of planes for help.)

DATA %>% 
  anti_join(DATA,by=KEY) %>%
  count(VAR) %>%
  arrange(desc(n))

Answer: ANSWER_HERE