Introduction

The main purpose of this tutorial is to put together all 5 key functions from dplyr and use them to create summary tables and delightful graphs in RMarkdown. The functions and their purposes are listed as follows:

filter() Selects Observations Based on Values
arrange() Sorts Observations Based on Criteria
select() or rename() Selects, Deselects, Renames, and Reorders Variables
mutate() or transmute() Creates New Variables Which Were Originally Nonexistant
summarize() Aggregates the Data Based on Helpful Summary Statistics

To further incite your anger towards the airline industry, we will continue to practice our skills using the dataset flights by loading the R package nycflights13. Until we learn how to fly, this dataset remains relevant.

I’ll spread my wings, and learn how to fly. I’ll do what it takes till I touch the sky.

– Kelly Clarkson

Part 1: Summarizing the Data (4 Points)

Using the pipe %>% to chain functions from dplyr, modify the code below to create a dataset named flight.summary that contains the following ordered modfications:

Starts with the raw dataset flights in the package nycflights13.
Transform delay variables so that they are measured in minutes since midnight, and then create a metric to measure accuracy based on those transformed delay variables. Use the geometric distance like seen in lecture.
Remove missing observations.
Group data based on combinations of “origin”, “dest”, and “carrier”.
Summarize the data using the number of observations, the accuracy mean, the accuracy variance, and the mean distance for each combination of “origin”, “dest”, and “carrier”.
Filters summarized data to remove all combinations of “origin”, “dest”, and “carrier” with less than or equal to 10 flights throughout 2013.
Release the grouping constraint using the function ungroup().
Create a variable called “proportion”” that represents the proportion of flights within each combination (with more than 10 flights).

flight.summary = 
  
  #1
  RAWDATA %>%
          
  #2
  mutate(
    dep_min       = (dep_time%/%100)*60+dep_time%%100,
    sched_dep_min = (sched_dep_time%/%100)*60+sched_dep_time%%100,
    arr_min       = (arr_time%/%100)*60+arr_time%%100,
    sched_arr_min = (sched_arr_time%/%100)*60+sched_arr_time%%100,
    dep_delay_min = dep_min-sched_dep_min,
    arr_delay_min = arr_min-sched_arr_min,
    accuracy      = ACCURACY_FORMULA
  ) %>%
  
  #3
  filter(CRITERIA) %>%
  
  #4
  group_by(VARIABLES) %>%

  #5
  summarize(
    count=FUNCTION1,
    mean.acc=FUNCTION2,
    var.acc=FUNCTION3,
    mean.dist=FUNCTION4
  ) %>%
  
  #6
  filter(CRITERIA) %>%
  
  #7
  ungroup() %>%
  
  #8
  mutate(proportion=FORMULA)

head(flight.summary) #DO NOT CHANGE THIS LINE OF CODE

Part 2: Building Charts From Summary Data

2.1 (1 Point)

The purpose of creating this plot was to investigate if flight accuracy gets worse or better for longer flights. The chart above displays the approximated linear relationship between flight accuracy and flight distance for each the 3 NYC airports. This first image was created using a combination of geom_point() and geom_smooth(). Recreate this image using ggplot() and remember to modify the defaults of geom_smooth() to get linear trend lines without standard error regions.

ggplot(data=flight.summary) +
  geom_point(MORE)+
  geom_smooth(MORE)

2.2 (1 Point)

The first thing we notice is that not many flights are over 3000 miles in distance. Design a new plot similar to the one above that ignores the scenarios where the distance exceeds 3000. This will prevent rare occurrences from effecting our linear trend comparison and present these relationships in a zoomed-in window where the majority of data exists.

ggplot(data=filter(flight.summary,CRITERIA)) +
  geom_point(MORE)+
  geom_smooth(MORE)

2.2 (1 Point)

Answer the following question using complete sentences:

Each trend line is doing different things. Which of the three originating airports is doing the worst regarding the relationship between flight accuracy and flight distance? Explain why.

REPLACE THIS WITH YOUR ANSWER

Part 3: Build a Nice Table to Be Displayed in RMarkdown (1 Point)

The dataset we created, flight.summary, summarizes the raw data but still contains 370 observations. Below I have provided code, that produces a dataset called flight.summary2 that only contains the top 5 and bottom 5 combinations of “origin”, “dest”, and “carrier” based on mean accuracy. Closely examine this code and ensure that you understand what is happening here. Set eval=T to create flight.summary2.

flight.summary2 = 
  flight.summary %>%
  mutate(rank=min_rank(mean.acc)) %>%
  filter(min_rank(mean.acc)<=5 | min_rank(desc(mean.acc))<=5) %>%
  arrange(rank)

The kable() function in the knitr package allows for creating HTML tables. Furthermore, the kableExtra package allows you to add beauty to those internet-ready tables. Click here for a helpful introduction. Click here for additional guidance on the options required to turn a tibble into HTML code using kable() and then HTML code into an integrated webpage table.

Once you understand the function of kable() and the modifications you can apply, use the R code chunk below to place the code required to produce an HTML table flight.summary2. After you knit the document, you will see the kable() function turns the flight.summary2 table in R into an HTML table.

If you have a problem with the kable() function, then check out the link here about the library DT and the datatable()function. In my opinion, the default output from the datatable() function is better than kable(). If you have problems with knitr or kable(), you can remove the comment in the first code chunk to run library(DT). You will need to install package DT before this works.

Part 4: Additional Questions (2 Points)

In this class you will be forced to analyze data you probably care very little about. This is the part in the class where you begin to act like you care. An important rule in this class is to “love thy data as thyself.” Consider the flights data from both the view of the airline executive and the customer. Given what you know about the data, write two questions that you would want to answer that would influence how you buy airline tickets.

WRITE A QUESTION HERE THAT YOU COULD ANSWER FROM THE DATA YOU HAVE
WRITE A QUESTION HERE THAT WOULD REQUIRE YOU TO GET ADDITIONAL INFORMATION AND HOW WOULD YOU GO ABOUT GETTING THAT INFORMATION OR WHERE COULD YOU GO TO GET THAT INFORMATION

Lab 3: Advanced Data Transformation

FIRSTNAME LASTNAME

January 31, 2025

Introduction

Part 1: Summarizing the Data (4 Points)

Part 2: Building Charts From Summary Data

2.1 (1 Point)

2.2 (1 Point)

2.2 (1 Point)

Part 3: Build a Nice Table to Be Displayed in RMarkdown (1 Point)

Part 4: Additional Questions (2 Points)