Introduction

In this tutorial, we will learn how to scrape the web for information. We will look at data spread across multiple websites that is worthy to be joined and analyzed. The R package rvest was developed by Hadley Wickham to provide a user-friendly web scraping experience. Combining rvest with a CSS selector tool, such as SelectorGadget, or built-in browser developer tools, makes this entire process more efficient and less painful. This tutorial, prescribed by Dr. Mario, will be guided and completed in stages during class. The following lists contain a lot more information that should be observed and cherished. The truths I will bestow upon you are inspired by many of the works below.

Part 1: Violent Crimes in US Cities

When I was a little boy, my teachers frowned on any information that was cited from online sources. In the early stages of Wikipedia, instructors considered it unreliable due to its openly editable format. In academia, Wikipedia is not found in research citations, but is often a first stop in the investigation on almost any topic in any field. A search may not give you all the information you want, but you will often be provided with helpful links to the source of the information. The search for Wikipedia on Wikipedia provides you with a historical overview including positives and negatives.

Many Wikipedia pages contain information organized in tables. Consider the US Crime Rates table that provides the crime rates per 100,000 people for major metropolitan areas in the United States. Statistics are given for violent crimes, property crimes and arson. We desire to convert the HTML table into an R data frame object so we can utilize this information in future analysis.

Chunk 1: Initial Web Scraping Attempt

Using rvest functions such as read_html() and html_table(), we create a data frame called VIOLENT that contains all information read from the desired Wikipedia page.

URL.VIOLENT="https://en.wikipedia.org/wiki/List_of_United_States_cities_by_crime_rate#Crime_rates_per_100.2C000_people_.282012.29"
VIOLENT = URL.VIOLENT %>%
  read_html() %>%
  html_table(fill=TRUE) %>%
  .[[1]]

str(VIOLENT)

Chunk 2: Initial Selection of Violent Crimes and Cleaning

You probably noticed an issue with the first row. Why do you think this occurred? Carefully, examine the webpage table to identify the cause. Also, we only want the violent crime statistics for each city and state. What columns do we need to keep?

VIOLENT2=VIOLENT[-(1:2),1:8]
colnames(VIOLENT2)=c("State","City","Population","Total","Murder","Rape","Robbery","Assault")
str(VIOLENT2)

Chunk 3: Fixing Variable Types

The subheadings in the Wikipedia table create problems when parsed to data frame. The presence of this problem causes known numeric variables to be treated as categorical variables. We need to use functions inside and outside tidyverse to fix this issue.

VIOLENT3=VIOLENT2 %>%
            mutate(Population=str_replace_all(Population,pattern=",",replacement="")) %>%
            mutate_at(3:8,as.numeric)
str(VIOLENT3)

Chunk 4: Identifying Problematic Cases in City and State

Using str_detect() we can identify cases in City where unwanted characters {“,”,“0” to “9”} make an appearance.

VIOLENT3[str_detect(VIOLENT3$City,"[,(0-9)]{1}"),]$City
VIOLENT3[str_detect(VIOLENT3$City,"[:digit:]"),]$City
VIOLENT3[str_detect(VIOLENT3$State,"[,(0-9)]{1}"),]$State

Chunk 5: Cleaning Problematic Cities

The unwanted characters {“,”,“0” to “9”} are special, but not our friends. In Wikipedia, they were used for indicating footnote information. We need to efficiently fix these occurrences by banishing them to the abyss. We see this in both cities and states. Also, there are other scenarios where cities have unusual names that may be problematic when involving additional data.

VIOLENT4 = VIOLENT3 %>%
              mutate(City=str_replace_all(City,"[,(0-9)]{1}","")) %>%
              mutate(State=str_replace_all(State,"[,(0-9)]{1}",""))

VIOLENT5 = VIOLENT4 %>% 
              mutate(City=ifelse(City=="Charlotte-Mecklenburg","Charlotte",City))

VIOLENT6 = VIOLENT5 %>%
              mutate(City=str_replace(City," Metro",""))

write_csv(VIOLENT6,"FINAL_VIOLENT.CSV")

Part 2: Geographical Locations of US Cities

The noncensus package in R contains preloaded datasets useful for population studies. Imagine if we wanted to plot the crime information from VIOLENT6 on a US map. What information would we need?

Chunk 1: Read in ZIP Data

The data in zip_codes provided in the noncensus package gives us important geographical information for all US cities.

data(zip_codes)
ZIP=zip_codes
str(ZIP)

Chunk 2: Summarize for Joining

Many US metropolitan areas cover multiple zip codes. Using group_by() and summarize(), we can obtain the average position for each combination of city and state.

ZIP2 = ZIP %>%
        group_by(city,state) %>%
        summarize(lat=mean(latitude),lon=mean(longitude)) %>%
        ungroup()

write_csv(ZIP2,"FINAL_ZIP.CSV")

Chunk 3: Assessing Future Join Issues

Run the code below. What makes merging ZIP2 with VIOLENT6 problematic?

head(VIOLENT6[,1:4],3)
head(ZIP2,3)

Part 3: Linking State Names to State Abbreviations

Using your favorite search engine, you can look up the government established abbreviations for each state in the US of A. An HTML table found at https://state.1keydata.com/state-abbreviations.php makes this connection for us. However, the way this data is presented is not in the tidy spirit. Soap can’t clean this, unless it is the Dove Beauty Bar.

Chunk 1: Web Scraping Table of State Abbreviations

URL.STATE_ABBREV = "https://state.1keydata.com/state-abbreviations.php"
STATE_ABBREV = URL.STATE_ABBREV %>%
                read_html() %>%
                html_table(fill=T) %>%
                .[[3]] %>%
                .[-1,] 

head(STATE_ABBREV)

Chunk 2: Transforming the Table

STATE_ABBREV_TOP = STATE_ABBREV[,1:2]
names(STATE_ABBREV_TOP)=c("State","Abbrev")
STATE_ABBREV_BOT = STATE_ABBREV[,3:4]
names(STATE_ABBREV_BOT)=c("State","Abbrev")
STATE_ABBREV2=rbind(STATE_ABBREV_TOP,STATE_ABBREV_BOT) %>% arrange(State)
head(STATE_ABBREV2)
write_csv(STATE_ABBREV2,"FINAL_STATE_ABBREV.CSV")

Part 4: Inclusion of Expert Opinion

SelectorGadget is a point-and-click CSS selector useful for identifying the CSS or XPath to the specific portion of the webpage you want to scrape.

An article at https://www.securitysales.com/fire-intrusion/2018-safest-most-dangerous-states-us/ summarizes a study performed by WalletHub which compared all 50 states using 48 key safety indicators. The article provides a link to the source along with lists providing the 10 most safest and 10 most dangerous states.

Chunk 1: Web Scraping Top 10 Most Safe and Least Safe States

Using SelectorGadget, we point-and-click the text information we want on the webpage. This is the list of state names which comprise the most safe and most dangerous states. After deselecting the information we don’t want, we can identify the CSS selector that links to the HTML element we want to scrape. Specifically for this case, the CSS selector is #articleContentWrapper li. The html_nodes() function in rvest locates in the HTML code where this CSS selector is being used to present the information visually on the webpage. Since this information is text, we use html_text() to read the list of states from the subsetted HTML code into a vector names SAFE_VS_DANGEROUS.

URL.SAFE_VS_DANGEROUS = "https://www.securitysales.com/fire-intrusion/2018-safest-most-dangerous-states-us/"
SAFE_VS_DANGEROUS = URL.SAFE_VS_DANGEROUS %>%
                      read_html() %>%
                      html_nodes(css="#articleContentWrapper li") %>%
                      html_text() 


print(SAFE_VS_DANGEROUS)

Chunk 2: Create a Data Frame Useful for Joining

Upon inspection of SAFE_VS_DANGEROUS, we notice that this vector of length 20 contains the top 10 most safe states and most dangerous states in the order they appear on the website. We create new data frame combining these states with their classification, according to the website, that indicates whether a state is considered safe or dangerous. By linking this information to the violent crimes dataset, we can determine analytically the extent to which violent crimes were considered in this expert opinion. This may lead to either disagreement or agreement with the expert opinion. The most important thing here is that we don’t behave like the miserable bloggers who wave their flag of ignorance with anecdotal evidence.

CLASS=c(rep("S",10),rep("D", 10)) #S=SAFE and D=DANGEROUS
print(CLASS)
SAFE_VS_DANGEROUS2=as.tibble(cbind(SAFE_VS_DANGEROUS,CLASS))
names(SAFE_VS_DANGEROUS2)=c("STATE","CLASS")
write_csv(SAFE_VS_DANGEROUS2,"FINAL_SAFE_VS_DANGEROUS.CSV")
head(SAFE_VS_DANGEROUS2)
tail(SAFE_VS_DANGEROUS2)