The main purpose of this lab is to practice data join skills from Chapter 10. The functions and their purposes are listed as follows:
inner_join()
Keeps observations appear in both
datasets.
left_join()
Keeps all observations in left
dataset.
right_join()
Keeps all observations in right
dataset.
full_join()
Keeps all observations in both
datasets.
semi_join()
Keeps all observations in left dataset
that have a match in right dataset.
anti_join()1
Drops all observations in left dataset
that have a match in right dataset.
You will need to modify the code chunks so that the code works within
each of chunk (usually this means modifying anything in ALL CAPS). You
will also need to modify the code outside the code chunk. When you get
the desired result for each step, change Eval=F
to
Eval=T
and knit the document to HTML to make sure it works.
After you complete the lab, you should submit your HTML file of what you
have completed to Canvas before the deadline.
In part 1, you will practice the skills using the datasets from the R
package Lahman
. This database includes data related to
baseball teams. It includes summary statistics about how the players
performed on offense and defense for several years. It also includes
personal information about the players.
The Batting
data frame contains the offensive statistics
for all players for many years. You can see, for example, the top 10
hitters in 2016 by running this code: (For more details of the dataset
run ?Batting
in console.)
top <- Batting %>%
filter(yearID == 2016) %>%
arrange(desc(HR)) %>%
slice(1:10)
top #Do Not Remove
## playerID yearID stint teamID lgID G AB R H X2B X3B HR RBI SB CS BB
## 1 trumbma01 2016 1 BAL AL 159 613 94 157 27 1 47 108 2 0 51
## 2 cruzne02 2016 1 SEA AL 155 589 96 169 27 1 43 105 2 0 62
## 3 daviskh01 2016 1 OAK AL 150 555 85 137 24 2 42 102 1 2 42
## 4 doziebr01 2016 1 MIN AL 155 615 104 165 35 5 42 99 18 2 61
## 5 encared01 2016 1 TOR AL 160 601 99 158 34 0 42 127 2 0 87
## 6 arenano01 2016 1 COL NL 160 618 116 182 35 6 41 133 2 3 68
## 7 cartech02 2016 1 MIL NL 160 549 84 122 27 1 41 94 3 1 76
## 8 frazito01 2016 1 CHA AL 158 590 89 133 21 0 40 98 15 5 64
## 9 bryankr01 2016 1 CHN NL 155 603 121 176 35 3 39 102 8 5 75
## 10 canoro01 2016 1 SEA AL 161 655 107 195 33 2 39 103 0 1 47
## SO IBB HBP SH SF GIDP
## 1 170 1 3 0 0 14
## 2 159 5 9 0 7 15
## 3 166 0 8 0 5 19
## 4 138 6 8 2 5 12
## 5 138 3 5 0 8 22
## 6 103 10 2 0 8 17
## 7 206 1 9 0 10 18
## 8 163 1 4 1 7 11
## 9 154 5 18 0 3 3
## 10 100 8 8 0 5 18
But who are these players? We see an ID, but not the names. The
player names are in this table named People
.
head(People,5)
## playerID birthYear birthMonth birthDay birthCountry birthState birthCity
## 1 aardsda01 1981 12 27 USA CO Denver
## 2 aaronha01 1934 2 5 USA AL Mobile
## 3 aaronto01 1939 8 5 USA AL Mobile
## 4 aasedo01 1954 9 8 USA CA Orange
## 5 abadan01 1972 8 25 USA FL Palm Beach
## deathYear deathMonth deathDay deathCountry deathState deathCity nameFirst
## 1 NA NA NA <NA> <NA> <NA> David
## 2 2021 1 22 USA GA Atlanta Hank
## 3 1984 8 16 USA GA Atlanta Tommie
## 4 NA NA NA <NA> <NA> <NA> Don
## 5 NA NA NA <NA> <NA> <NA> Andy
## nameLast nameGiven weight height bats throws debut finalGame
## 1 Aardsma David Allan 215 75 R R 2004-04-06 2015-08-23
## 2 Aaron Henry Louis 180 72 R R 1954-04-13 1976-10-03
## 3 Aaron Tommie Lee 190 75 R R 1962-04-10 1971-09-26
## 4 Aase Donald William 190 75 R R 1977-07-26 1990-10-03
## 5 Abad Fausto Andres 184 73 L L 2001-09-10 2006-04-13
## retroID bbrefID deathDate birthDate
## 1 aardd001 aardsda01 <NA> 1981-12-27
## 2 aaroh101 aaronha01 2021-01-22 1934-02-05
## 3 aarot101 aaronto01 1984-08-16 1939-08-05
## 4 aased001 aasedo01 <NA> 1954-09-08
## 5 abada001 abadan01 <NA> 1972-08-25
We can see column names nameFirst
and
nameLast
in table People
.
Use the left_join
function to create a data frame called
top1
, which contains information of the 10 top home run
hitters. The table should have the following columns:
playerID
, nameFirst
, nameLast
,
and number of home runs (HR
).
top1 = top %>%
left_join(DATA, by = KEY) %>%
select(VARIABLES)
top1 #Do Not Remove
Data Salaries
contains the baseball player salary
data.
head(Salaries,5)
## yearID teamID lgID playerID salary
## 1 1985 ATL NL barkele01 870000
## 2 1985 ATL NL bedrost01 550000
## 3 1985 ATL NL benedbr01 545000
## 4 1985 ATL NL campri01 633333
## 5 1985 ATL NL ceronri01 625000
You may be curious about the salaries of the top 10 hitters in 2016:
top2
by adding top
10 hitters’ salaries to top1
and including only
nameFirst
, nameLast
, teamID
,
HR
, and salary
columns.FirstName
, LastName
,
Team
, Homeruns
and Salary
respectively.Salary
in descending
order.Note that salaries are different every year so make sure to filter
for the year 2016. This time, only use right_join
to
complete the exercise.
top2 = DATA %>%
filter(CONDITION) %>%
right_join(DATA,by=KEY) %>%
select(SELECT_AND_RENAME) %>%
arrange(desc(VAR))
top2 #Do Not Remove
In this part, we will explore relational data from
nycflights13
, which contains four data frames related to
the flights
table that you used in previous
assignments.
Data airports
gives information about each airport, such
as latitude and longitude, identified by the faa
airport
code.
head(airports,5)
## # A tibble: 5 × 8
## faa name lat lon alt tz dst tzone
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/New…
## 2 06A Moton Field Municipal Airport 32.5 -85.7 264 -6 A America/Chi…
## 3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/Chi…
## 4 06N Randall Airport 41.4 -74.4 523 -5 A America/New…
## 5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A America/New…
Based on flights
, compute the average arrival delay by
destination (dest
) and ignore missing values, then join on
the airports
data frame. Only use an inner join.
delay = DATA %>%
group_by(VAR) %>%
summarise(avg_arr_delay=COMPUTATION,.groups='drop') %>%
inner_join(DATA,by=KEY) %>%
select(VARS)
delay #Do Not Remove
Draw a scatterplot with dots representing destination locations and colors of dots representing average arrival delay on US map. We do this to look at the spatial distribution of average arrival delay.
delay %>%
ggplot(aes(SPECIFY_AESTHETICS)) +
borders("state") +
geom_point() +
coord_quickmap()
Data planes
gives information about each plane,
identified by its tailnum
. Note that year
column in planes
represents the year a plane was
manufactured, which is different from year
column in
flights
.
head(planes,5)
## # A tibble: 5 × 9
## tailnum year type manufacturer model engines seats speed engine
## <chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
## 1 N10156 2004 Fixed wing multi … EMBRAER EMB-… 2 55 NA Turbo…
## 2 N102UW 1998 Fixed wing multi … AIRBUS INDU… A320… 2 182 NA Turbo…
## 3 N103US 1999 Fixed wing multi … AIRBUS INDU… A320… 2 182 NA Turbo…
## 4 N104UW 1999 Fixed wing multi … AIRBUS INDU… A320… 2 182 NA Turbo…
## 5 N10575 2002 Fixed wing multi … EMBRAER EMB-… 2 55 NA Turbo…
Use the planes
data to calculate the age
of
planes, assuming current year is 2013. Keep only tailnum
and age
in the output table plane_ages
.
plane_ages <-
DATA %>%
mutate(age = FUNCTION) %>%
select(VARS)
plane_ages #Do Not Change
Is there a relationship between the age of a plane and its delays?
plane_ages
with flights
, keeping
observations with matches in both datasets.age
and
ignore missing values.DATA %>%
inner_join(DATA, by = KEY) %>%
group_by(VAR) %>%
summarise(avg_dep_delay = FUNCTION, .groups='drop') %>%
ggplot(aes(SPECIFY_AESTHETICS)) +
geom_point()
What does it mean for a flight to have a missing
tailnum
?
flights %>%
filter(is.na(VAR))
Answer: ANSWER_HERE
What do the tail numbers that don’t have a matching record in planes
have in common? (Hint: one variable explains ~90% of the problems. Check
the documentation of planes
for help.)
DATA %>%
anti_join(DATA,by=KEY) %>%
count(VAR) %>%
arrange(desc(n))
Answer: ANSWER_HERE