Overview: For each question, show your R code that you used to answer each question in the provided chunks. When a written response is required, be sure to answer the entire question in complete sentences outside the code chunks. When figures are required, be sure to follow all requirements to receive full credit. Point values are assigned for every part of this analysis. Do not work with other students on this assignment. You are to complete this assignment by yourself.
Helpful: Make sure you knit the document as you go
through the assignment. Check all your results in the created HTML file.
Change eval=FALSE
to eval=TRUE
as you answer
the questions.
Submission: Submit via an electronic document on Canvas. Must be submitted as an HTML file generated in RStudio.
Many times in data science, your data will be split between many different sources, some of which may be online. In this analysis assignment, we will webscrape country level data from multiple websites, clean the data individually, and merge the data. The website Worldometers contains very interesting country level data that when connected may allow us to learn interesting things about the wonderful world in which we exist:
Follow the steps to accomplish specific tasks, and do not modify code
on lines with the comment #DO NOT CHANGE. Add code in R
code chunks wherever you see COMPLETE. After completing each code chunk,
change eval=FALSE
to eval=TRUE
and
knit the document. If you don’t change,
eval=FALSE
to eval=TRUE
, there will be a large
penalty since none of your output will show in the HTML.
Information at Worldometer GDP contains GDP data from 2023 published by the world bank. GDP is the monetary value of goods and services produced within a country over a period of time. On this website, GDP is presented in dollars.
Webscrape the data from https://www.worldometers.info/gdp/gdp-by-country/ into a
data frame in R called GDP
. If done correctly, you should
have a new object in R called GDP
containing all the data
in the table on the website.
Each time you run this code chunk, you may have to run the first line of this code every time to get it to work. You may get some error related to “invalid connection” or “open connection”. You can also try knitting the document every time you make any change to this code chunk to see the output in the HTML file.
URL.GDP="https://www.worldometers.info/gdp/gdp-by-country/" #DO NOT CHANGE
GDP = COMPLETE
head(GDP) #DO NOT CHANGE
Remove the first and fourth variables from GDP
and
create a new data frame named GDP2
based on this
change.
GDP2 = COMPLETE
head(GDP2) #DO NOT CHANGE
Create a new data frame named GDP3
based off
GDP2
where the variables
GDP (nominal, 2023)
,GDP growth
,
Population (2023)
, GDP per capita
, and
Share of World GDP
become GDP
,
Growth
, Population
, PerCapita
,
and Share
, respectively. Be careful!! In the original
variable names, there are multiple spaces between “GDP” and “growth” in
GDP growth
and multiple spaces between “GDP” and “per
capita” in GDP per capita
.
GDP3 = COMPLETE
names(GDP3) #DO NOT CHANGE
Next, we must clean the data so there are no dollar signs or percent
signs in the data using str_replace()
. The dollar sign is a
special character and must be referenced as \\$
. Create a
new data frame named GDP4
where the dollar signs and
percent signs are removed from all necessary variables.
COMPLETE
str(GDP4) #DO NOT CHANGE
Next, create a new data frame named GDP5
where all
commas are removed from potentially numeric variables.
COMPLETE
str(GDP5) #DO NOT CHANGE
Create a new data frame called GDP6
where all the
variables except Country
are changed to numeric
variables.
COMPLETE
str(GDP6) #DO NOT CHANGE
Rewrite over the original GDP
variable with a new
variable called GDP
that is in trillions of
dollars rather than in actual dollars. Rewrite over the
original Population
variable with a new variable of the
same name that is in millions of people rather than in
actual people. You are scaling the original variables to change the
units without changing the variable names. Save your changes in a new
data frame called GDP7
.
COMPLETE
str(GDP7) #DO NOT CHANGE
Check out the Wikipedia page (https://en.wikipedia.org/wiki/Education_Index) which contains the education index for many countries from 1990 to 2019.
Webscrape the data from (https://en.wikipedia.org/wiki/Education_Index) into a
data frame in R, then clean the data so that you have three variables:
Country
, Year
, and Ed.Index
. The
Year
variable should be numeric. Also, sort the data by
Country
.
The final data frame should be called EDU
. I want you to
use the pipe so all of this is done sequentially starting with the URL.
If you don’t use the pipe as requested, then you will lose points.
URL.EDU="https://en.wikipedia.org/wiki/Education_Index" #DO NOT CHANGE
EDU = COMPLETE
head(EDU) #DO NOT CHANGE
Create a new dataset called EDU2
that summarizes the
data in EDU
by country. Calculate the average of the
education index and the standard deviation of the education index for
each country using the yearly education index from 1995 to 2019 only.
These statistics should be named AVG.EDU
and
SD.EDU
, respectively. Also, I only want EDU2
to contain the average and standard deviation for countries that have no
missing values during the years 1995 to 2019. Just like
EDU
, this data frame EDU2
should be sorted
alphabetically from A to Z.
EDU2 = COMPLETE
head(EDU2,20) #DO NOT CHANGE
The nrow()
function in R counts the number of rows in a
tibble or data frame. I want you to write code that uses the
nrow()
function to count the number of countries from
EDU
that have at least 1 missing value over the years 1995
to 2019. Your output should just be 1 number.
COMPLETE
I want you to write code that shows all the data in EDU2
only for countries that are not present in GDP7
. I want you
to use the anti_join()
function in your code and your
output should be a tibble or data frame.
COMPLETE
Some of the countries in EDU2
don’t have a match in
GDP7
because their names don’t match exactly. The list
below contains the countries in EDU2
that I want you to
relabel so that they are identical to what they are called in
GDP7
. Create a data frame called EDU3
that
fixes these problems from EDU2
.
EDU3 = COMPLETE
filter(EDU3, Country %in% c("Bolivia","DR Congo","Czech Republic (Czechia)", #DO NOT CHANGE
"Iran","South Korea","Russia","Vietnam")) #DO NOT CHANGE
Check out the Wikipedia page (https://en.wikipedia.org/wiki/List_of_countries_by_total_health_expenditure_per_capita). On this webpage there is a table that shows the total health spending per capita in PPP international dollars (not inflation-adjusted) according to the World Health Organization.
Webscrape the appropriate table discussed above from (https://en.wikipedia.org/wiki/List_of_countries_by_total_health_expenditure_per_capita),
create a new variable called Health.Change
that is the 2020
value minus the 2018 value, then remove the variables named
2018
, 2019
, 2020
, and
2021
, then remove countries that having missing values due
to the fact that we don’t know the health care spending amount for
either 2020 or 2018. The final data frame should be called
HEALTH
and should only contain two variables
Location
and Health.Change
. You will need to
do some data cleaning to perform the calculation.
URL.HEALTH="https://en.wikipedia.org/wiki/List_of_countries_by_total_health_expenditure_per_capita" #DO NOT CHANGE
HEALTH = COMPLETE
str(HEALTH) #DO NOT CHANGE
In HEALTH
, each location has an * symbol that needs to
be removed if we are going to merge this dataset with the other
datasets. Also, it would be helpful if the location variable was renamed
to Country
. Create a data frame called HEALTH2
that fixes these 2 problems from HEALTH
. The asterisk is a
special character in regular expressions and cannot easily be removed.
Use Google to figure out the appropriate regular expression to identify
an asterisk in a string and use the regular expression to remove it.
HEALTH2 = COMPLETE
str(HEALTH2) #DO NOT CHANGE
Create a data frame called GDP7.EDU3
that performs an
inner join of the GDP7
dataset with EDU3
.
GDP7.EDU3 = COMPLETE
str(GDP7.EDU3) #DO NOT CHANGE
If you do an inner join between GDP7
and
HEALTH2
, there is clearly a problem. The country Ireland is
in both datasets. Why are there so many countries, like Ireland, that
are clearly represented in both data frames not showing up in the merged
data frame? Answer the question in complete sentences in the appropriate
space below. I don’t want you to fix the problem, I just want you to
identify why there is a problem.
#Use this code chunk to inspect why there is a problem
Answer REPLACE THIS WITH YOUR ANSWER IN COMPLETE SENTENCES