Instructions

Overview: For each question, show your R code that you used to answer each question in the provided chunks. When a written response is required, be sure to answer the entire question in complete sentences outside the code chunks. When figures are required, be sure to follow all requirements to receive full credit. Point values are assigned for every part of this analysis. Do not work with other students on this assignment. You are to complete this assignment by yourself.

Helpful: Make sure you knit the document as you go through the assignment. Check all your results in the created HTML file. Change eval=FALSE to eval=TRUE as you answer the questions.

Submission: Submit via an electronic document on Canvas. Must be submitted as an HTML file generated in RStudio.

Introduction

Many times in data science, your data will be split between many different sources, some of which may be online. In this analysis assignment, we will webscrape country level data from multiple websites, clean the data individually, and merge the data. The website Worldometers contains very interesting country level data that when connected may allow us to learn interesting things about the wonderful world in which we exist:

Follow the steps to accomplish specific tasks, and do not modify code on lines with the comment #DO NOT CHANGE. Add code in R code chunks wherever you see COMPLETE. After completing each code chunk, change eval=FALSE to eval=TRUE and knit the document. If you don’t change, eval=FALSE to eval=TRUE, there will be a large penalty since none of your output will show in the HTML.

Assignment

Part 1: GDP by Country

Information at Worldometer GDP contains GDP data from 2025 published by the world bank. GDP is the monetary value of goods and services produced within a country over a period of time. On this website, GDP is presented in dollars.

Q1 (2 Points)

Webscrape the data from https://www.worldometers.info/gdp/gdp-by-country/ into a data frame in R called GDP. If done correctly, you should have a new object in R called GDP which is a data frame with 218 observations and 6 variables.

Each time you run this code chunk, you may have to run the first line of this code every time to get it to work. You may get some error related to “invalid connection” or “open connection”. You can also try knitting the document every time you make any change to this code chunk to see the output in the HTML file.

URL.GDP="https://www.worldometers.info/gdp/gdp-by-country/" #DO NOT CHANGE

GDP = COMPLETE

head(GDP) #DO NOT CHANGE

Q2 (2 Points)

Remove the first and third variables from GDP and create a new data frame named GDP2 based on this change.

GDP2 = COMPLETE

head(GDP2) #DO NOT CHANGE

Q3 (3 Points)

Create a new data frame named GDP3 based off GDP2 where the variables GDP (Full Value), GDP Growth and GDP per capita become GDP, Growth and PerCapita, respectively.

GDP3 = COMPLETE

names(GDP3) #DO NOT CHANGE

Q4 (3 Points)

Next, we must clean the data so there are no dollar signs or percent signs in the data using str_replace(). The dollar sign is a special character and must be referenced as \\$. Create a new data frame named GDP4 where the dollar signs and percent signs are removed from all necessary variables.

COMPLETE

str(GDP4) #DO NOT CHANGE

Q5 (3 Points)

Next, create a new data frame named GDP5 where all commas are removed from potentially numeric variables. Also, there is a problem with the Growth variable. This variable contains a “minus sign” which is distinct from a hyphen. You will need to identify the unicode of the minus sign and use it with str_replace() to change it to the classic hyphen.

COMPLETE

str(GDP5) #DO NOT CHANGE

Q6 (2 Points)

Create a new data frame called GDP6 where all the variables except Country are changed to numeric variables.

COMPLETE

str(GDP6) #DO NOT CHANGE

Q7 (2 Points)

Rewrite over the original GDP variable with a new variable called GDP that is in trillions of dollars rather than in actual dollars. Rewrite over the original Population variable with a new variable of the same name that is in millions of people rather than in actual people. You are scaling the original variables to change the units without changing the variable names. Save your changes in a new data frame called GDP7.

COMPLETE

str(GDP7)  #DO NOT CHANGE

Part 2: Education Index by Country

Check out the Wikipedia page (https://en.wikipedia.org/wiki/Education_Index) which contains the education index for many countries from 1990 to 2019.

Q1 (4 Points)

Webscrape the data from (https://en.wikipedia.org/wiki/Education_Index) into a data frame in R, then clean the data so that you have three variables: Country, Year, and Ed.Index. The Year variable should be numeric. Also, sort the data by Country.

The final data frame should be called EDU. I want you to use the pipe so all of this is done sequentially starting with the URL. If you don’t use the pipe as requested, then you will lose points.

URL.EDU="https://en.wikipedia.org/wiki/Education_Index" #DO NOT CHANGE

EDU = COMPLETE

head(EDU) #DO NOT CHANGE

Q2 (4 Points)

Create a new dataset called EDU2 that summarizes the data in EDU by country. Calculate the average of the education index and the standard deviation of the education index for each country only for the years 2000 to 2019. These statistics should be named AVG.EDU and SD.EDU, respectively. Also, I only want to EDU2 to contain the average and standard deviation for countries that don’t have any missing values during the years 2000 to 2019. Just like EDU, this data frame EDU2 should be sorted alphabetically from A to Z.

EDU2 = COMPLETE
  
head(EDU2,20) #DO NOT CHANGE

Q3 (2 Points)

The nrow() function in R counts the number of rows in a tibble or data frame. I want you to write code that uses the nrow() function to count the number of countries from EDU that have at least 1 missing value over the years 2000 to 2019. Your output should just be 1 number.

COMPLETE

Q4 (2 Points)

I want you to write code that shows all the data in EDU2 only for countries that are not present in GDP7. I want you to use the anti_join() function in your code and your output should be a tibble or data frame.

COMPLETE

Q5 (2 Points)

Some of the countries in EDU2 don’t have a match in GDP7 because their names don’t match exactly. The list below contains the countries in EDU2 that I want you to rename (relabel) so that they are identical to what they are called in GDP7. Create a data frame called EDU3 that fixes these problems from EDU2.

  • Bolivia
  • Democratic Republic of Congo
  • Czech Republic
  • Iran
  • South Korea
  • Russia
  • Tanzania
  • Venezuela
  • Vietnam
EDU3 = COMPLETE

filter(EDU3, Country %in% c("Bolivia","DR Congo","Czech Republic (Czechia)", #DO NOT CHANGE
                            "Iran","South Korea","Russia","Tanzania","Venezuela","Vietnam")) #DO NOT CHANGE

Part 3: World Health Organization Data by Country

Check out the Wikipedia page (https://en.wikipedia.org/wiki/List_of_countries_by_total_health_expenditure_per_capita). On this webpage there is a table that uses OECD data and shows each country’s total spending (public and private) on health per capita in PPP international dollars.

Q1 (4 Points)

Webscrape the appropriate table discussed above from (https://en.wikipedia.org/wiki/List_of_countries_by_total_health_expenditure_per_capita), create a new variable called Health.Change that is the 2024 value minus the 2020 value, then remove the variables named 2020, 2021, 2022, 2023, and 2024, then remove countries that having missing values due to the fact that we don’t know the health care spending amount for either 2024 or 2020. The final data frame should be called HEALTH and should only contain two variables Locations and Health.Change. You will need to do some data cleaning to perform the calculation.

URL.HEALTH="https://en.wikipedia.org/wiki/List_of_countries_by_total_health_expenditure_per_capita" #DO NOT CHANGE

HEALTH = COMPLETE

str(HEALTH) #DO NOT CHANGE

Part 3: Merging the Datasets

Q1 (2 Points)

Create a data frame called GDP7.EDU3 that performs an inner join of the GDP7 dataset with EDU3.

GDP7.EDU3 = COMPLETE

str(GDP7.EDU3) #DO NOT CHANGE

Q2 (4 Points)

I want you to write code to fix two country names in HEALTH so that when you merge the data from GDP7 into HEALTH you don’t have any missing values. After fixing the two country names in HEALTH, use a right join appropriately to merge the data from GDP7 into HEALTH so you have the three GDP-related variables with the Health.Change variable, but only for OECD countries. I want the merged dataset to be called GDP7.HEALTH. This can be done in multiple lines of code. I don’t care if you do all this with string of code connected via the pipe. I do care that GDP7.HEALTH is created using a right join.

GDP7.HEALTH = COMPLETE

filter(GDP7.HEALTH,Country %in% c("South Korea","Czech Republic (Czechia)"))  #DO NOT CHANGE