In this tutorial, we will learn how to scrape the web for
information. We will look at data spread across multiple websites that
is worthy to be joined and analyzed. The R package rvest
was developed by Hadley Wickham to provide a user-friendly web scraping
experience. Combining rvest
with a CSS selector tool, such
as SelectorGadget, or built-in
browser developer tools, makes this entire process more efficient and
less painful. This tutorial, prescribed by Dr. Mario, will be guided and
completed in stages during class. The following lists contain a lot more
information that should be observed and cherished. The truths I will
bestow upon you are inspired by many of the works below.
When I was a little boy, my teachers frowned on any information that was cited from online sources. In the early stages of Wikipedia, instructors considered it unreliable due to its openly editable format. In academia, Wikipedia is not found in research citations, but is often a first stop in the investigation on almost any topic in any field. A search may not give you all the information you want, but you will often be provided with helpful links to the source of the information. The search for Wikipedia on Wikipedia provides you with a historical overview including positives and negatives.
Many Wikipedia pages contain information organized in tables. Consider the US Crime Rates table that provides the crime rates per 100,000 people for major metropolitan areas in the United States. Statistics are given for violent crimes, property crimes and arson. We desire to convert the HTML table into an R data frame object so we can utilize this information in future analysis.
Using rvest
functions such as read_html()
and html_table()
, we create a data frame called
VIOLENT
that contains all information read from the desired
Wikipedia page.
URL.VIOLENT="https://en.wikipedia.org/wiki/List_of_United_States_cities_by_crime_rate#Crime_rates_per_100.2C000_people_.282012.29"
VIOLENT = URL.VIOLENT %>%
read_html() %>%
html_table(fill=TRUE) %>%
.[[1]]
str(VIOLENT)
You probably noticed an issue with the first row. Why do you think this occurred? Carefully, examine the webpage table to identify the cause. Also, we only want the violent crime statistics for each city and state. What columns do we need to keep?
VIOLENT2=VIOLENT[-(1:2),1:8]
colnames(VIOLENT2)=c("State","City","Population","Total","Murder","Rape","Robbery","Assault")
str(VIOLENT2)
The subheadings in the Wikipedia table create problems when parsed to
data frame. The presence of this problem causes known numeric variables
to be treated as categorical variables. We need to use functions inside
and outside tidyverse
to fix this issue.
VIOLENT3=VIOLENT2 %>%
mutate(Population=str_replace_all(Population,pattern=",",replacement="")) %>%
mutate_at(3:8,as.numeric)
str(VIOLENT3)
City
and
State
Using str_detect()
we can identify cases in
City
where unwanted characters {“,”,“0” to “9”} make an
appearance.
VIOLENT3[str_detect(VIOLENT3$City,"[,(0-9)]{1}"),]$City
VIOLENT3[str_detect(VIOLENT3$City,"[:digit:]"),]$City
VIOLENT3[str_detect(VIOLENT3$State,"[,(0-9)]{1}"),]$State
The unwanted characters {“,”,“0” to “9”} are special, but not our friends. In Wikipedia, they were used for indicating footnote information. We need to efficiently fix these occurrences by banishing them to the abyss. We see this in both cities and states. Also, there are other scenarios where cities have unusual names that may be problematic when involving additional data.
VIOLENT4 = VIOLENT3 %>%
mutate(City=str_replace_all(City,"[,(0-9)]{1}","")) %>%
mutate(State=str_replace_all(State,"[,(0-9)]{1}",""))
VIOLENT5 = VIOLENT4 %>%
mutate(City=ifelse(City=="Charlotte-Mecklenburg","Charlotte",City))
VIOLENT6 = VIOLENT5 %>%
mutate(City=str_replace(City," Metro",""))
write_csv(VIOLENT6,"FINAL_VIOLENT.CSV")
The noncensus
package in R contains preloaded datasets useful for population studies.
Imagine if we wanted to plot the crime information from
VIOLENT6
on a US map. What information would we need?
The data in zip_codes
provided in the
noncensus
package gives us important geographical
information for all US cities.
data(zip_codes)
ZIP=zip_codes
str(ZIP)
Many US metropolitan areas cover multiple zip codes. Using
group_by()
and summarize()
, we can obtain the
average position for each combination of city and state.
ZIP2 = ZIP %>%
group_by(city,state) %>%
summarize(lat=mean(latitude),lon=mean(longitude)) %>%
ungroup()
write_csv(ZIP2,"FINAL_ZIP.CSV")
Run the code below. What makes merging ZIP2
with
VIOLENT6
problematic?
head(VIOLENT6[,1:4],3)
head(ZIP2,3)
Using your favorite search engine, you can look up the government established abbreviations for each state in the US of A. An HTML table found at https://state.1keydata.com/state-abbreviations.php makes this connection for us. However, the way this data is presented is not in the tidy spirit. Soap can’t clean this, unless it is the Dove Beauty Bar.
URL.STATE_ABBREV = "https://state.1keydata.com/state-abbreviations.php"
STATE_ABBREV = URL.STATE_ABBREV %>%
read_html() %>%
html_table(fill=T) %>%
.[[3]] %>%
.[-1,]
head(STATE_ABBREV)
STATE_ABBREV_TOP = STATE_ABBREV[,1:2]
names(STATE_ABBREV_TOP)=c("State","Abbrev")
STATE_ABBREV_BOT = STATE_ABBREV[,3:4]
names(STATE_ABBREV_BOT)=c("State","Abbrev")
STATE_ABBREV2=rbind(STATE_ABBREV_TOP,STATE_ABBREV_BOT) %>% arrange(State)
head(STATE_ABBREV2)
write_csv(STATE_ABBREV2,"FINAL_STATE_ABBREV.CSV")
SelectorGadget is a point-and-click CSS selector useful for identifying the CSS or XPath to the specific portion of the webpage you want to scrape.
An article at https://www.securitysales.com/fire-intrusion/2018-safest-most-dangerous-states-us/ summarizes a study performed by WalletHub which compared all 50 states using 48 key safety indicators. The article provides a link to the source along with lists providing the 10 most safest and 10 most dangerous states.
Using SelectorGadget, we
point-and-click the text information we want on the webpage. This is the
list of state names which comprise the most safe and most dangerous
states. After deselecting the information we don’t want, we can identify
the CSS selector that links to the HTML element we want to scrape.
Specifically for this case, the CSS selector is
#articleContentWrapper li. The html_nodes()
function in rvest
locates in the HTML code where this CSS
selector is being used to present the information visually on the
webpage. Since this information is text, we use html_text()
to read the list of states from the subsetted HTML code into a vector
names SAFE_VS_DANGEROUS
.
URL.SAFE_VS_DANGEROUS = "https://www.securitysales.com/fire-intrusion/2018-safest-most-dangerous-states-us/"
SAFE_VS_DANGEROUS = URL.SAFE_VS_DANGEROUS %>%
read_html() %>%
html_nodes(css="#articleContentWrapper li") %>%
html_text()
print(SAFE_VS_DANGEROUS)
Upon inspection of SAFE_VS_DANGEROUS
, we notice that
this vector of length 20 contains the top 10 most safe states
and most dangerous states in the order they appear on the website. We
create new data frame combining these states with their classification,
according to the website, that indicates whether a state is considered
safe or dangerous. By linking this information to the violent crimes
dataset, we can determine analytically the extent to which violent
crimes were considered in this expert opinion. This may lead to either
disagreement or agreement with the expert opinion. The most
important thing here is that we don’t behave like the miserable bloggers
who wave their flag of ignorance with anecdotal evidence.
CLASS=c(rep("S",10),rep("D", 10)) #S=SAFE and D=DANGEROUS
print(CLASS)
SAFE_VS_DANGEROUS2=as.tibble(cbind(SAFE_VS_DANGEROUS,CLASS))
names(SAFE_VS_DANGEROUS2)=c("STATE","CLASS")
write_csv(SAFE_VS_DANGEROUS2,"FINAL_SAFE_VS_DANGEROUS.CSV")
head(SAFE_VS_DANGEROUS2)
tail(SAFE_VS_DANGEROUS2)