Assignment

Part 1: GDP by Country

Information at Worldometer GDP contains GDP data from 2023 published by the world bank. GDP is the monetary value of goods and services produced within a country over a period of time. On this website, GDP is presented in dollars.

Q1 (2 Points)

Webscrape the data from https://www.worldometers.info/gdp/gdp-by-country/ into a data frame in R called GDP. If done correctly, you should have a new object in R called GDP containing all the data in the table on the website.

Each time you run this code chunk, you may have to run the first line of this code every time to get it to work. You may get some error related to “invalid connection” or “open connection”. You can also try knitting the document every time you make any change to this code chunk to see the output in the HTML file.

URL.GDP="https://www.worldometers.info/gdp/gdp-by-country/" #DO NOT CHANGE

GDP = COMPLETE

head(GDP) #DO NOT CHANGE

Q2 (2 Points)

Remove the first and fourth variables from GDP and create a new data frame named GDP2 based on this change.

GDP2 = COMPLETE

head(GDP2) #DO NOT CHANGE

Q3 (3 Points)

Create a new data frame named GDP3 based off GDP2 where the variables GDP (nominal, 2023),GDP growth, Population (2023), GDP per capita, and Share of World GDP become GDP, Growth, Population, PerCapita, and Share, respectively. Be careful!! In the original variable names, there are multiple spaces between “GDP” and “growth” in GDP growth and multiple spaces between “GDP” and “per capita” in GDP per capita.

GDP3 = COMPLETE

names(GDP3) #DO NOT CHANGE

Q4 (3 Points)

Next, we must clean the data so there are no dollar signs or percent signs in the data using str_replace(). The dollar sign is a special character and must be referenced as \\$. Create a new data frame named GDP4 where the dollar signs and percent signs are removed from all necessary variables.

COMPLETE

str(GDP4) #DO NOT CHANGE

Q5 (3 Points)

Next, create a new data frame named GDP5 where all commas are removed from potentially numeric variables.

COMPLETE

str(GDP5) #DO NOT CHANGE

Q6 (2 Points)

Create a new data frame called GDP6 where all the variables except Country are changed to numeric variables.

COMPLETE

str(GDP6) #DO NOT CHANGE

Q7 (2 Points)

Rewrite over the original GDP variable with a new variable called GDP that is in trillions of dollars rather than in actual dollars. Rewrite over the original Population variable with a new variable of the same name that is in millions of people rather than in actual people. You are scaling the original variables to change the units without changing the variable names. Save your changes in a new data frame called GDP7.

COMPLETE

str(GDP7)  #DO NOT CHANGE

Part 2: Education Index by Country

Check out the Wikipedia page (https://en.wikipedia.org/wiki/Education_Index) which contains the education index for many countries from 1990 to 2019.

Q1 (4 Points)

Webscrape the data from (https://en.wikipedia.org/wiki/Education_Index) into a data frame in R, then clean the data so that you have three variables: Country, Year, and Ed.Index. The Year variable should be numeric. Also, sort the data by Country.

The final data frame should be called EDU. I want you to use the pipe so all of this is done sequentially starting with the URL. If you don’t use the pipe as requested, then you will lose points.

URL.EDU="https://en.wikipedia.org/wiki/Education_Index" #DO NOT CHANGE

EDU = COMPLETE

head(EDU) #DO NOT CHANGE

Q2 (4 Points)

Create a new dataset called EDU2 that summarizes the data in EDU by country. Calculate the average of the education index and the standard deviation of the education index for each country using the yearly education index from 1995 to 2019 only. These statistics should be named AVG.EDU and SD.EDU, respectively. Also, I only want EDU2 to contain the average and standard deviation for countries that have no missing values during the years 1995 to 2019. Just like EDU, this data frame EDU2 should be sorted alphabetically from A to Z.

EDU2 = COMPLETE
  
head(EDU2,20) #DO NOT CHANGE

Q3 (2 Points)

The nrow() function in R counts the number of rows in a tibble or data frame. I want you to write code that uses the nrow() function to count the number of countries from EDU that have at least 1 missing value over the years 1995 to 2019. Your output should just be 1 number.

COMPLETE

Q4 (2 Points)

I want you to write code that shows all the data in EDU2 only for countries that are not present in GDP7. I want you to use the anti_join() function in your code and your output should be a tibble or data frame.

COMPLETE

Q5 (2 Points)

Some of the countries in EDU2 don’t have a match in GDP7 because their names don’t match exactly. The list below contains the countries in EDU2 that I want you to relabel so that they are identical to what they are called in GDP7. Create a data frame called EDU3 that fixes these problems from EDU2.

Bolivia
Democratic Republic of Congo
Czech Republic
Iran
South Korea
Russia
Vietnam

EDU3 = COMPLETE

filter(EDU3, Country %in% c("Bolivia","DR Congo","Czech Republic (Czechia)", #DO NOT CHANGE
                            "Iran","South Korea","Russia","Vietnam")) #DO NOT CHANGE

Part 3: World Health Organization Data by Country

Check out the Wikipedia page (https://en.wikipedia.org/wiki/List_of_countries_by_total_health_expenditure_per_capita). On this webpage there is a table that shows the total health spending per capita in PPP international dollars (not inflation-adjusted) according to the World Health Organization.

Q1 (4 Points)

Webscrape the appropriate table discussed above from (https://en.wikipedia.org/wiki/List_of_countries_by_total_health_expenditure_per_capita), create a new variable called Health.Change that is the 2020 value minus the 2018 value, then remove the variables named 2018, 2019, 2020, and 2021, then remove countries that having missing values due to the fact that we don’t know the health care spending amount for either 2020 or 2018. The final data frame should be called HEALTH and should only contain two variables Location and Health.Change. You will need to do some data cleaning to perform the calculation.

URL.HEALTH="https://en.wikipedia.org/wiki/List_of_countries_by_total_health_expenditure_per_capita" #DO NOT CHANGE

HEALTH = COMPLETE

str(HEALTH) #DO NOT CHANGE

Q2 (2 Points)

In HEALTH, each location has an * symbol that needs to be removed if we are going to merge this dataset with the other datasets. Also, it would be helpful if the location variable was renamed to Country. Create a data frame called HEALTH2 that fixes these 2 problems from HEALTH. The asterisk is a special character in regular expressions and cannot easily be removed. Use Google to figure out the appropriate regular expression to identify an asterisk in a string and use the regular expression to remove it.

HEALTH2 = COMPLETE

str(HEALTH2) #DO NOT CHANGE

Part 4: Merging the Datasets

Q1 (2 Points)

Create a data frame called GDP7.EDU3 that performs an inner join of the GDP7 dataset with EDU3.

GDP7.EDU3 = COMPLETE

str(GDP7.EDU3) #DO NOT CHANGE

Q2 (2 Points)

If you do an inner join between GDP7 and HEALTH2, there is clearly a problem. The country Ireland is in both datasets. Why are there so many countries, like Ireland, that are clearly represented in both data frames not showing up in the merged data frame? Answer the question in complete sentences in the appropriate space below. I don’t want you to fix the problem, I just want you to identify why there is a problem.

#Use this code chunk to inspect why there is a problem

Answer REPLACE THIS WITH YOUR ANSWER IN COMPLETE SENTENCES

Analysis 2: Connecting Country Level Data

FIRSTNAME LASTNAME

May 14, 2025

Instructions

Introduction

Assignment

Part 1: GDP by Country

Q1 (2 Points)

Q2 (2 Points)

Q3 (3 Points)

Q4 (3 Points)

Q5 (3 Points)

Q6 (2 Points)

Q7 (2 Points)

Part 2: Education Index by Country

Q1 (4 Points)

Q2 (4 Points)

Q3 (2 Points)

Q4 (2 Points)

Q5 (2 Points)

Part 3: World Health Organization Data by Country

Q1 (4 Points)

Q2 (2 Points)

Part 4: Merging the Datasets

Q1 (2 Points)

Q2 (2 Points)