Overview: For each question, show your R code that you used to answer each question in the provided chunks. When a written response is required, be sure to answer the entire question in complete sentences outside the code chunks. When figures are required, be sure to follow all requirements to receive full credit. Point values are assigned for every part of this analysis. Do not work with other students on this assignment. You are to complete this assignment by yourself.
Helpful: Make sure you knit the document as you go
through the assignment. Check all your results in the created HTML file.
Change eval=FALSE to eval=TRUE as you answer
the questions.
Submission: Submit via an electronic document on Canvas. Must be submitted as an HTML file generated in RStudio.
Many times in data science, your data will be split between many different sources, some of which may be online. In this analysis assignment, we will webscrape country level data from multiple websites, clean the data individually, and merge the data. The website Worldometers contains very interesting country level data that when connected may allow us to learn interesting things about the wonderful world in which we exist:
Follow the steps to accomplish specific tasks, and do not modify code
on lines with the comment #DO NOT CHANGE. Add code in R
code chunks wherever you see COMPLETE. After completing each code chunk,
change eval=FALSE to eval=TRUE and
knit the document. If you don’t change,
eval=FALSE to eval=TRUE, there will be a large
penalty since none of your output will show in the HTML.
Information at Worldometer GDP contains GDP data from 2025 published by the world bank. GDP is the monetary value of goods and services produced within a country over a period of time. On this website, GDP is presented in dollars.
Webscrape the data from https://www.worldometers.info/gdp/gdp-by-country/ into a
data frame in R called GDP. If done correctly, you should
have a new object in R called GDP which is a data frame
with 218 observations and 6 variables.
Each time you run this code chunk, you may have to run the first line of this code every time to get it to work. You may get some error related to “invalid connection” or “open connection”. You can also try knitting the document every time you make any change to this code chunk to see the output in the HTML file.
URL.GDP="https://www.worldometers.info/gdp/gdp-by-country/" #DO NOT CHANGE
GDP = COMPLETE
head(GDP) #DO NOT CHANGE
Remove the first and third variables from GDP and create
a new data frame named GDP2 based on this change.
GDP2 = COMPLETE
head(GDP2) #DO NOT CHANGE
Create a new data frame named GDP3 based off
GDP2 where the variables GDP (Full Value),
GDP Growth and GDP per capita become
GDP, Growth and PerCapita,
respectively.
GDP3 = COMPLETE
names(GDP3) #DO NOT CHANGE
Next, we must clean the data so there are no dollar signs or percent
signs in the data using str_replace(). The dollar sign is a
special character and must be referenced as \\$. Create a
new data frame named GDP4 where the dollar signs and
percent signs are removed from all necessary variables.
COMPLETE
str(GDP4) #DO NOT CHANGE
Next, create a new data frame named GDP5 where all
commas are removed from potentially numeric variables. Also, there is a
problem with the Growth variable. This variable contains a
“minus sign” which is distinct from a hyphen. You will need to identify
the unicode of the minus sign and use it with str_replace()
to change it to the classic hyphen.
COMPLETE
str(GDP5) #DO NOT CHANGE
Create a new data frame called GDP6 where all the
variables except Country are changed to numeric
variables.
COMPLETE
str(GDP6) #DO NOT CHANGE
Rewrite over the original GDP variable with a new
variable called GDP that is in trillions of
dollars rather than in actual dollars. Rewrite over the
original Population variable with a new variable of the
same name that is in millions of people rather than in
actual people. You are scaling the original variables to change the
units without changing the variable names. Save your changes in a new
data frame called GDP7.
COMPLETE
str(GDP7) #DO NOT CHANGE
Check out the Wikipedia page (https://en.wikipedia.org/wiki/Education_Index) which contains the education index for many countries from 1990 to 2019.
Webscrape the data from (https://en.wikipedia.org/wiki/Education_Index) into a
data frame in R, then clean the data so that you have three variables:
Country, Year, and Ed.Index. The
Year variable should be numeric. Also, sort the data by
Country.
The final data frame should be called EDU. I want you to
use the pipe so all of this is done sequentially starting with the URL.
If you don’t use the pipe as requested, then you will lose points.
URL.EDU="https://en.wikipedia.org/wiki/Education_Index" #DO NOT CHANGE
EDU = COMPLETE
head(EDU) #DO NOT CHANGE
Create a new dataset called EDU2 that summarizes the
data in EDU by country. Calculate the average of the
education index and the standard deviation of the education index for
each country only for the years 2000 to 2019. These statistics should be
named AVG.EDU and SD.EDU, respectively. Also,
I only want to EDU2 to contain the average and standard
deviation for countries that don’t have any missing values during the
years 2000 to 2019. Just like EDU, this data frame
EDU2 should be sorted alphabetically from A to Z.
EDU2 = COMPLETE
head(EDU2,20) #DO NOT CHANGE
The nrow() function in R counts the number of rows in a
tibble or data frame. I want you to write code that uses the
nrow() function to count the number of countries from
EDU that have at least 1 missing value over the years 2000
to 2019. Your output should just be 1 number.
COMPLETE
I want you to write code that shows all the data in EDU2
only for countries that are not present in GDP7. I want you
to use the anti_join() function in your code and your
output should be a tibble or data frame.
COMPLETE
Some of the countries in EDU2 don’t have a match in
GDP7 because their names don’t match exactly. The list
below contains the countries in EDU2 that I want you to
rename (relabel) so that they are identical to what they are called in
GDP7. Create a data frame called EDU3 that
fixes these problems from EDU2.
EDU3 = COMPLETE
filter(EDU3, Country %in% c("Bolivia","DR Congo","Czech Republic (Czechia)", #DO NOT CHANGE
"Iran","South Korea","Russia","Tanzania","Venezuela","Vietnam")) #DO NOT CHANGE
Check out the Wikipedia page (https://en.wikipedia.org/wiki/List_of_countries_by_total_health_expenditure_per_capita). On this webpage there is a table that uses OECD data and shows each country’s total spending (public and private) on health per capita in PPP international dollars.
Webscrape the appropriate table discussed above from (https://en.wikipedia.org/wiki/List_of_countries_by_total_health_expenditure_per_capita),
create a new variable called Health.Change that is the 2024
value minus the 2020 value, then remove the variables named
2020, 2021, 2022,
2023, and 2024, then remove countries that
having missing values due to the fact that we don’t know the health care
spending amount for either 2024 or 2020. The final data frame should be
called HEALTH and should only contain two variables
Locations and Health.Change. You will need to
do some data cleaning to perform the calculation.
URL.HEALTH="https://en.wikipedia.org/wiki/List_of_countries_by_total_health_expenditure_per_capita" #DO NOT CHANGE
HEALTH = COMPLETE
str(HEALTH) #DO NOT CHANGE
Create a data frame called GDP7.EDU3 that performs an
inner join of the GDP7 dataset with EDU3.
GDP7.EDU3 = COMPLETE
str(GDP7.EDU3) #DO NOT CHANGE
I want you to write code to fix two country names in
HEALTH so that when you merge the data from
GDP7 into HEALTH you don’t have any missing
values. After fixing the two country names in HEALTH, use a
right join appropriately to merge the data from GDP7 into
HEALTH so you have the three GDP-related variables with the
Health.Change variable, but only for OECD countries. I want
the merged dataset to be called GDP7.HEALTH. This can be
done in multiple lines of code. I don’t care if you do all this with
string of code connected via the pipe. I do care that
GDP7.HEALTH is created using a right
join.
GDP7.HEALTH = COMPLETE
filter(GDP7.HEALTH,Country %in% c("South Korea","Czech Republic (Czechia)")) #DO NOT CHANGE