Instructions

Overview: For each question, show your R code that you used to answer each question in the provided chunks. When a written response is required, be sure to answer the entire question in complete sentences outside the code chunks. When figures are required, be sure to follow all requirements to receive full credit. Point values are assigned for every part of this analysis. Do not work with other students on this assignment. You are to complete this assignment by yourself.

Helpful: Make sure you knit the document as you go through the assignment. Check all your results in the created HTML file. Change eval=FALSE to eval=TRUE as you answer the questions.

Submission: Submit via an electronic document on Canvas. Must be submitted as an HTML file generated in RStudio.

Introduction

Many times in data science, your data will be split between many different sources, some of which may be online. In this analysis assignment, we will webscrape country level data from multiple websites, clean the data individually, and merge the data. The website Worldometers contains very interesting country level data that when connected may allow us to learn interesting things about the wonderful world in which we exist:

Follow the steps to accomplish specific tasks, and do not modify code on lines with the comment #DO NOT CHANGE. Add code in R code chunks wherever you see COMPLETE. After completing each code chunk, change eval=FALSE to eval=TRUE and knit the document. If you don’t change, eval=FALSE to eval=TRUE, there will be a large penalty since none of your output will show in the HTML.

Assignment

Part 1: GDP by Country

Information at Worldometer GDP contains GDP data from 2022 published by the world bank. GDP is the monetary value of goods and services produced within a country over a period of time. On this website, GDP is presented in dollars.

Q1 (2 Points)

Webscrape the data from https://www.worldometers.info/gdp/gdp-by-country/ into a data frame in R called GDP. If done correctly, you should have a new object in R called GDP which is a data frame with 190 observations and 8 variables.

URL.GDP="https://www.worldometers.info/gdp/gdp-by-country/" #DO NOT CHANGE

GDP = COMPLETE

head(GDP) #DO NOT CHANGE

Q2 (2 Points)

Remove the first and fourth variables from GDP and create a new data frame named GDP2 based on this change.

GDP2 = COMPLETE

head(GDP2) #DO NOT CHANGE

Q3 (3 Points)

Create a new data frame named GDP3 based off GDP2 where the variables GDP (nominal, 2022),GDP growth, Population (2022), GDP per capita, and Share of World GDP become GDP, Growth, Population, PerCapita, and Share, respectively. Be careful!! In the original variable names, there are multiple spaces between “GDP” and “growth” in GDP growth and multiple spaces between “GDP” and “per capita” in GDP per capita.

GDP3 = COMPLETE

names(GDP3) #DO NOT CHANGE

Q4 (3 Points)

Next, we must clean the data so there are no dollar signs or percent signs in the data using str_replace(). The dollar sign is a special character and must be referenced as \\$. Create a new data frame named GDP4 where the dollar signs and percent signs are removed from all necessary variables.

COMPLETE

str(GDP4) #DO NOT CHANGE

Q5 (3 Points)

Next, create a new data frame named GDP5 where all commas are removed from potentially numeric variables.

COMPLETE

str(GDP5) #DO NOT CHANGE

Q6 (2 Points)

Create a new data frame called GDP6 where all the variables except Country are changed to numeric variables.

COMPLETE

str(GDP6) #DO NOT CHANGE

Q7 (2 Points)

Rewrite over the original GDP variable with a new variable called GDP that is in trillions of dollars rather than in actual dollars. Rewrite over the original Population variable with a new variable of the same name that is in millions of people rather than in actual people. You are scaling the original variables to change the units without changing the variable names. Save your changes in a new data frame called GDP7.

COMPLETE

str(GDP7)  #DO NOT CHANGE

Part 2: Education Index by Country

Check out the Wikipedia page (https://en.wikipedia.org/wiki/Education_Index) which contains the education index for many countries from 1990 to 2019.

Q1 (4 Points)

Webscrape the data from (https://en.wikipedia.org/wiki/Education_Index) into a data frame in R, then clean the data so that you have three variables: Country, Year, and Ed.Index. The Year variable should be numeric. Also, sort the data by Country.

The final data frame should be called EDU. I want you to use the pipe so all of this is done sequentially starting with the URL. If you don’t use the pipe as requested, then you will lose points.

URL.EDU="https://en.wikipedia.org/wiki/Education_Index" #DO NOT CHANGE

EDU = COMPLETE

head(EDU) #DO NOT CHANGE

Q2 (4 Points)

Create a new dataset called EDU2 that summarizes the data in EDU by country. Calculate the average of the education index and the standard deviation of the education index for each country only for the years 2000 to 2019. These statistics should be named AVG.EDU and SD.EDU, respectively. Also, I only want to EDU2 to contain the average and standard deviation for countries that don’t have any missing values during the years 2000 to 2019. Just like EDU, this data frame EDU2 should be sorted alphabetically from A to Z.

EDU2 = COMPLETE
  
head(EDU2,20) #DO NOT CHANGE

Q3 (2 Points)

The nrow() function in R counts the number of rows in a tibble or data frame. I want you to write code that uses the nrow() function to count the number of countries from EDU that have at least 1 missing value over the years 2000 to 2019. Your output should just be 1 number.

COMPLETE

Q4 (2 Points)

I want you to write code that shows all the data in EDU2 only for countries that are not present in GDP7. I want you to use the anti_join() function in your code and your output should be a tibble or data frame.

COMPLETE

Q5 (2 Points)

Some of the countries in EDU2 don’t have a match in GDP7 because their names don’t match exactly. The list below contains the countries in EDU2 that I want you to rename (relabel) so that they are identical to what they are called in GDP7. Create a data frame called EDU3 that fixes these problems from EDU2.

  • Bolivia
  • Democratic Republic of Congo
  • Czech Republic
  • Iran
  • South Korea
  • Russia
  • Vietnam
EDU3 = COMPLETE

filter(EDU3, Country %in% c("Bolivia","DR Congo","Czech Republic (Czechia)", #DO NOT CHANGE
                            "Iran","South Korea","Russia","Vietnam")) #DO NOT CHANGE

Part 3: World Health Organization Data by Country

Check out the Wikipedia page (https://en.wikipedia.org/wiki/List_of_countries_by_total_health_expenditure_per_capita). On this webpage there is a table that shows the total health spending per capita in PPP international dollars (not inflation-adjusted) according to the World Health Organization.

Q1 (4 Points)

Webscrape the appropriate table discussed above from (https://en.wikipedia.org/wiki/List_of_countries_by_total_health_expenditure_per_capita), create a new variable called Health.Change that is the 2020 value minus the 2018 value, then remove the variables named 2018, 2019, 2020, and 2021, then remove countries that having missing values due to the fact that we don’t know the health care spending amount for either 2020 or 2018. The final data frame should be called HEALTH and should only contain two variables Location and Health.Change. You will need to do some data cleaning to perform the calculation.

URL.HEALTH="https://en.wikipedia.org/wiki/List_of_countries_by_total_health_expenditure_per_capita" #DO NOT CHANGE

HEALTH = COMPLETE

str(HEALTH) #DO NOT CHANGE

Q2 (2 Points)

In HEALTH, each location has an * symbol that needs to be removed if we are going to merge this dataset with the other datasets. Create a data frame called HEALTH2 that fixes this problem from HEALTH. The asterisk is a special character and must be referenced as \\*.

HEALTH2 = COMPLETE

str(HEALTH2) #DO NOT CHANGE

Part 3: Merging the Datasets

Q1 (2 Points)

Create a data frame called GDP7.EDU3 that performs an inner join of the GDP7 dataset with EDU3.

GDP7.EDU3 = COMPLETE

str(GDP7.EDU3) #DO NOT CHANGE

Q2 (2 Points)

If you do an inner join between GDP7 and HEALTH2, there is clearly a problem. The country Ireland is in both datasets. Why are there so many countries, like Ireland, that are clearly represented in both data frames not showing up in the merged data frame? Answer the question in complete sentences in the appropriate space below. I don’t want you to fix the problem, I just want you to identify why there is a problem.

#Use this code chunk to inspect why there is a problem

Answer REPLACE THIS WITH YOUR ANSWER IN COMPLETE SENTENCES