Assignment

Part 1: GDP by Country

Information at Worldometer GDP contains GDP data from 2025 published by the world bank. GDP is the monetary value of goods and services produced within a country over a period of time. On this website, GDP is presented in dollars.

Q1 (2 Points)

Webscrape the data from https://www.worldometers.info/gdp/gdp-by-country/ into a data frame in R called GDP. If done correctly, you should have a new object in R called GDP which is a data frame with 218 observations and 6 variables.

Each time you run this code chunk, you may have to run the first line of this code every time to get it to work. You may get some error related to “invalid connection” or “open connection”. You can also try knitting the document every time you make any change to this code chunk to see the output in the HTML file.

URL.GDP="https://www.worldometers.info/gdp/gdp-by-country/" #DO NOT CHANGE

GDP = COMPLETE

head(GDP) #DO NOT CHANGE

Q2 (2 Points)

Remove the first and third variables from GDP and create a new data frame named GDP2 based on this change.

GDP2 = COMPLETE

head(GDP2) #DO NOT CHANGE

Q3 (3 Points)

Create a new data frame named GDP3 based off GDP2 where the variables GDP (Full Value), GDP Growth and GDP per capita become GDP, Growth and PerCapita, respectively.

GDP3 = COMPLETE

names(GDP3) #DO NOT CHANGE

Q4 (3 Points)

Next, we must clean the data so there are no dollar signs or percent signs in the data using str_replace(). The dollar sign is a special character and must be referenced as \\$. Create a new data frame named GDP4 where the dollar signs and percent signs are removed from all necessary variables.

COMPLETE

str(GDP4) #DO NOT CHANGE

Q5 (3 Points)

Next, create a new data frame named GDP5 where all commas are removed from potentially numeric variables. Also, there is a problem with the Growth variable. This variable contains a “minus sign” which is distinct from a hyphen. You will need to identify the unicode of the minus sign and use it with str_replace() to change it to the classic hyphen.

COMPLETE

str(GDP5) #DO NOT CHANGE

Q6 (2 Points)

Create a new data frame called GDP6 where all the variables except Country are changed to numeric variables.

COMPLETE

str(GDP6) #DO NOT CHANGE

Q7 (2 Points)

Rewrite over the original GDP variable with a new variable called GDP that is in trillions of dollars rather than in actual dollars. Rewrite over the original Population variable with a new variable of the same name that is in millions of people rather than in actual people. You are scaling the original variables to change the units without changing the variable names. Save your changes in a new data frame called GDP7.

COMPLETE

str(GDP7)  #DO NOT CHANGE

Part 2: Education Index by Country

Check out the Wikipedia page (https://en.wikipedia.org/wiki/Education_Index) which contains the education index for many countries from 1990 to 2019.

Q1 (4 Points)

Webscrape the data from (https://en.wikipedia.org/wiki/Education_Index) into a data frame in R, then clean the data so that you have three variables: Country, Year, and Ed.Index. The Year variable should be numeric. Also, sort the data by Country.

The final data frame should be called EDU. I want you to use the pipe so all of this is done sequentially starting with the URL. If you don’t use the pipe as requested, then you will lose points.

URL.EDU="https://en.wikipedia.org/wiki/Education_Index" #DO NOT CHANGE

EDU = COMPLETE

head(EDU) #DO NOT CHANGE

Q2 (4 Points)

Create a new dataset called EDU2 that summarizes the data in EDU by country. Calculate the average of the education index and the standard deviation of the education index for each country only for the years 2000 to 2019. These statistics should be named AVG.EDU and SD.EDU, respectively. Also, I only want to EDU2 to contain the average and standard deviation for countries that don’t have any missing values during the years 2000 to 2019. Just like EDU, this data frame EDU2 should be sorted alphabetically from A to Z.

EDU2 = COMPLETE
  
head(EDU2,20) #DO NOT CHANGE

Q3 (2 Points)

The nrow() function in R counts the number of rows in a tibble or data frame. I want you to write code that uses the nrow() function to count the number of countries from EDU that have at least 1 missing value over the years 2000 to 2019. Your output should just be 1 number.

COMPLETE

Q4 (2 Points)

I want you to write code that shows all the data in EDU2 only for countries that are not present in GDP7. I want you to use the anti_join() function in your code and your output should be a tibble or data frame.

COMPLETE

Q5 (2 Points)

Some of the countries in EDU2 don’t have a match in GDP7 because their names don’t match exactly. The list below contains the countries in EDU2 that I want you to rename (relabel) so that they are identical to what they are called in GDP7. Create a data frame called EDU3 that fixes these problems from EDU2.

Bolivia
Democratic Republic of Congo
Czech Republic
Iran
South Korea
Russia
Tanzania
Venezuela
Vietnam

EDU3 = COMPLETE

filter(EDU3, Country %in% c("Bolivia","DR Congo","Czech Republic (Czechia)", #DO NOT CHANGE
                            "Iran","South Korea","Russia","Tanzania","Venezuela","Vietnam")) #DO NOT CHANGE

Part 3: World Health Organization Data by Country

Check out the Wikipedia page (https://en.wikipedia.org/wiki/List_of_countries_by_total_health_expenditure_per_capita). On this webpage there is a table that uses OECD data and shows each country’s total spending (public and private) on health per capita in PPP international dollars.

Q1 (4 Points)

Webscrape the appropriate table discussed above from (https://en.wikipedia.org/wiki/List_of_countries_by_total_health_expenditure_per_capita), create a new variable called Health.Change that is the 2024 value minus the 2020 value, then remove the variables named 2020, 2021, 2022, 2023, and 2024, then remove countries that having missing values due to the fact that we don’t know the health care spending amount for either 2024 or 2020. The final data frame should be called HEALTH and should only contain two variables Locations and Health.Change. You will need to do some data cleaning to perform the calculation.

URL.HEALTH="https://en.wikipedia.org/wiki/List_of_countries_by_total_health_expenditure_per_capita" #DO NOT CHANGE

HEALTH = COMPLETE

str(HEALTH) #DO NOT CHANGE

Part 3: Merging the Datasets

Q1 (2 Points)

Create a data frame called GDP7.EDU3 that performs an inner join of the GDP7 dataset with EDU3.

GDP7.EDU3 = COMPLETE

str(GDP7.EDU3) #DO NOT CHANGE

Q2 (4 Points)

I want you to write code to fix two country names in HEALTH so that when you merge the data from GDP7 into HEALTH you don’t have any missing values. After fixing the two country names in HEALTH, use a right join appropriately to merge the data from GDP7 into HEALTH so you have the three GDP-related variables with the Health.Change variable, but only for OECD countries. I want the merged dataset to be called GDP7.HEALTH. This can be done in multiple lines of code. I don’t care if you do all this with string of code connected via the pipe. I do care that GDP7.HEALTH is created using a right join.

GDP7.HEALTH = COMPLETE

filter(GDP7.HEALTH,Country %in% c("South Korea","Czech Republic (Czechia)"))  #DO NOT CHANGE

Mini Project 5: Connecting Country Level Data

FIRSTNAME LASTNAME

February 04, 2026

Instructions

Introduction

Assignment

Part 1: GDP by Country

Q1 (2 Points)

Q2 (2 Points)

Q3 (3 Points)

Q4 (3 Points)

Q5 (3 Points)

Q6 (2 Points)

Q7 (2 Points)

Part 2: Education Index by Country

Q1 (4 Points)

Q2 (4 Points)

Q3 (2 Points)

Q4 (2 Points)

Q5 (2 Points)

Part 3: World Health Organization Data by Country

Q1 (4 Points)

Part 3: Merging the Datasets

Q1 (2 Points)

Q2 (4 Points)