Instructions

Overview: For each question, show your R code that you used to answer each question in the provided chunks. When a written response is required, be sure to answer the entire question in complete sentences outside the code chunks. When figures are required, be sure to follow all requirements to receive full credit. Point values are assigned for every part of this analysis. Do not work with other students on this assignment. You are to complete this assignment by yourself.

Helpful: Make sure you knit the document as you go through the assignment. Check all your results in the created HTML file.

Submission: Submit via an electronic document on Canvas. Must be submitted as an HTML file generated in RStudio.

Introduction

Universities are typically opaque, bureaucratic institutions. To be transparent to tax payers, many public schools, such as the University of North Carolina, openly report salary information. In this assignment, we will analyze this information from 2022 to answer pivotal questions that have endured over the course of time. This salary data for UNC-Chapel Hill faculty and staff was downloaded in CSV format and titled “Salary Data Export.csv”.

To answer all the questions, you will need the R package tidyverse to make figures and utilize dplyr functions.

Data Information

Make sure the CSV data file is contained in the folder of your RMarkdown file. First, we start by using the read_csv function from the readr package found within the tidyverse. The code below executes this process by creating a tibble in your R environment named “salary”.

salary=read_csv("Salary Data Export.csv")

Now, we will explore the information that is contained in this dataset. The code below provides the names of the variables contained in the dataset.

names(salary)
##  [1] "University" "Last_Name"  "First_Name" "INIT"       "Age"       
##  [6] "Hire_Date"  "Category"   "Salary"     "Department" "Title"

Next, we will examine the type of data contains in these different variables.

str(salary,give.attr=F)
## spc_tbl_ [13,001 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ University: chr [1:13001] "UNC-CH" "UNC-CH" "UNC-CH" "UNC-CH" ...
##  $ Last_Name : chr [1:13001] "Jin" "Tien" "McManus" "Rizvi" ...
##  $ First_Name: chr [1:13001] "Guifeng" "Hsiao-Chuan" "James" "Imran" ...
##  $ INIT      : chr [1:13001] NA NA "M" NA ...
##  $ Age       : num [1:13001] 61 58 60 47 53 35 66 46 44 40 ...
##  $ Hire_Date : chr [1:13001] "1/27/1997" "11/1/2008" "1/13/2020" "7/1/2018" ...
##  $ Category  : chr [1:13001] "IT Database Professional, Database Administrator" "IT Database Professional, Data Administrator" "Research Professional, Mathematics / Computing / Statistics" "Assistant Professor" ...
##  $ Salary    : num [1:13001] 99998 99998 99988 99988 99975 ...
##  $ Department: chr [1:13001] "Carolina Population Center" "Carolina Population Center" "Renaissance Computing Inst" "Biomedical Engineering-UG" ...
##  $ Title     : chr [1:13001] "Applications Analyst" "Applications Analyst" "Research Scientist" "Assistant Professor" ...

You will notice that the variable “Hire_Date” is recorded as a character. The following code will first modify the original dataset to change this to a date variable with the format mm/dd/yyyy. Then, we will remove the hyphens to create a numeric variable as yyyymmdd. Finally, in the spirit of tidyverse, we will convert this data frame to a tibble.

salary$Hire_Date_Pre=as.Date(salary$Hire_Date, format="%m/%d/%Y")
salary$Hire_Date=as.numeric(gsub("-","",salary$Hire_Date_Pre))
salary=as_tibble(salary)

Now, we will use head() to view of first five rows and the modifications made to the original data. The rest of the assignment will extend off this modified dataset named salary which by now should be in your global environment.

head(salary,5)
## # A tibble: 5 × 11
##   University Last_Name First_Name  INIT    Age Hire_Date Category         Salary
##   <chr>      <chr>     <chr>       <chr> <dbl>     <dbl> <chr>             <dbl>
## 1 UNC-CH     Jin       Guifeng     <NA>     61  19970127 IT Database Pro… 99998 
## 2 UNC-CH     Tien      Hsiao-Chuan <NA>     58  20081101 IT Database Pro… 99998 
## 3 UNC-CH     McManus   James       M        60  20200113 Research Profes… 99988 
## 4 UNC-CH     Rizvi     Imran       <NA>     47  20180701 Assistant Profe… 99988.
## 5 UNC-CH     O'HARA    RICHARD     J        53  19931213 Research Profes… 99975 
## # ℹ 3 more variables: Department <chr>, Title <chr>, Hire_Date_Pre <date>

Assignment

Part 1: Reducing the Data to a Smaller Set of Interest

Q1 (2 Points)

Create a new dataset named salary2 that only contains the following variables:

  • “Last_Name”

  • “First_Name”

  • “Age”

  • “Hire_Date”

  • “Category”

  • “Salary”

  • “Department”

Then, use the names() function to display the variable names of salary2.

#

Q2 (2 Points)

Now, we modify salary2. Rename the variables “Last_Name”,“First_Name”, and “Hire_Date” to “Last”,“First”, and “Hire”, respectively. Do this for a new dataset called salary3 and use names() to display the variable names of salary3.

#

Q3 (2 Points)

Now, we modify salary3. Create a new variable called “HireYear” that only contains the first four digits of the variable “Hire” in a new dataset named salary4. Use the function str() to ensure that your new variable “HireYear” reports the year of the date that the employee was hired. The variable “HireYear” should be numeric.

#

Q4 (2 points)

Now, we modify salary4. Create a new variable called “YrsEmployed” which reports the number of full years the employee has worked at UNC. Use the current year and assume that all employees are hired January 1. Create a new dataset named salary5 and again use str() to display the variables in salary5.

#

Q5 (3 points)

Now, we modify salary5 to create our final dataset named salary.final. Use the pipe %>% to make the following changes:

  • Drop the variables “Hire”, “HireYear”, “First”, and “Last”.

  • Sort the observations first by “YrsEmployed” and then by “Salary”, both in ascending order from smallest to largest.

  • Rearrange the variables so that “YrsEmployed” and “Salary” are the first two variables in the dataset, in that order, without removing any of the other variables.

After you have used the %>% to make these changes, use the function head() to display the first 12 rows of salary.final.

#

Part 2: Answering Questions Based on All Data

In these questions,find the answer using code, and then answer the question using complete sentences below the code. I want to see your answer calculated/presented in your code output and in your writing.

Q6 (3 Points)

What is the difference between the mean salary of employees in my department and the mean salary of employees in Economics?

Code (2 Points):

#

Answer (1 Point): (Place Answer Here in Complete Sentences)

Q7 (3 Points)

What proportion of all assistant professors make more than $90,000?

Code (2 Points):

#

Answer (1 Points): (Place Answer Here in Complete Sentences)

Q8 (5 Points)

The categories “Instructor” and “Lecturer” represent employees who are primarily hired for teaching. For all the departments that have at least 3 employees who are primarily hired for teaching, I want you to calculate the total departmental spending for employees who are primarily hired for teaching. Of the departments with at least 3 instructors or lecturers, which department spends the most money for these teaching position and which department spends the least money for these teaching positions. I also want to know how many employees are primarily hired for teaching in each of these two departments. Your code output should show a table that shows the two departments, the number of employees hired for teaching, and the total spending.

Code (4 Points):

#

Answer (1 Points): (Place Answer Here in Complete Sentences)

Part 3: Answering Questions Based on Summarized Data

In general, it is not good to build an analysis off summarized data. Typically, you would want to analyze relationships and build models off your raw data where each observation is an individual. The future questions are based on summarized data. By doing this, I am unable to make inference or draw conclusions about the population of employees since each observation (row) is no longer a person. Think about this as you answer the following questions.

Q9 (4 Points)

Based off the data in salary.final, create a grouped summary based off combinations of “Department” and “Category”. Call the new tibble dept_cat_summary. Your summarized tibble, dept_cat_summary, should report all of the following statistics with corresponding variable names in the following order.

  • “n” = number of employees for each combination

  • “salary” = average salary for each combination

  • “yrs” = average years employed for each combination

  • “age_vs_salary” = correlation between the age and salary for each combination

In the process, make sure you use ungroup() with the pipe %>% to release the grouping so future work is no longer group specific. Also, the cor() can be used to calculate the correlation between two variables.

Following the creation of dept_cat_summary, prove that your code showing the first 12 rows.

#

Q10 (4 Points)

Using the summarized data in dept_cat_summary, which 5 “Department” and “Category” combinations have the weakest correlation between age and salary. Only consider “Department” and “Category” combinations that have at least 10 total employees. You can write as much code as you want, but your output should show a table (tibble or data frame) that only lists the “Department”, “Category”, and the “age_vs_salary” correlation for the 5 combinations you identified.

Code:

#

Q11 (4 points)

Create a scatter plot using geom_point() along with fitted linear regression lines using geom_smooth showing the linear relationship between average salary of Professors and the average years employed of Professors. Put the average years employed on the x-axis of your plot. For this plot, use the summarized data in dept_cat_summary and exclude Departments where there are less than 5 professors.

Following the plot, please explain what this plot suggests about the relationship between the average years employed and the average salary of professors. Make reference to the figure and use descriptive adjectives (i.e. “strong”, “weak”, etc.) and terms (i.e. “positive”, “negative”, etc.) that are appropriate for discussing linear relationships. When consulting the graphic, think about or research online what the shaded region around the regression line indicates.

Code and Figure (2 Points):

#

Answer (2 Points): (Place Answer Here in Complete Sentences)

Q12 (4 Points)

In dept_cat_summary, there are 750 unique departments. You can verify this by using length(unique(dept_title_summary$Department)). I want you to select only 3 academic departments not previously discussed or outputted in the assignment, and display the scatter plots and fitted linear regression lines representing the relationship between the average years employed and the average salary. Typically, the variable we would want to predict is on the y-axis, so assume that we want to predict the average salary. Use facet_wrap so that each department has its own figure. Remove the shading of the confidence intervals for the regression lines. For this plot, use the summarized data in dept_cat_summary.

Code and Figure:

#

Q13 (3 Points)

Most academic departments have at least one employee in each of the three following categories: “Professor”,“Associate Professor”, and “Assistant Professor”. I want you to choose 3 academic departments that were not previously discussed and not used in your previous question (Q12). These 3 academic departments have to have at least one employee in each of the three categories mentioned. Then, I want you to construct a tile plot where “Department” is on the x-axis and “Category” is on the y-axis. The color of the tiles should be based off the average salary of all employees in each combination of “Department” and “Category”. In the tile, plot I only want to see the three departments you chose on the x-axis and the three categories I chose on the y-axis. I don’t want to see every job category. If you construct the plot and discover one or more of the tiles are blank since one or more of your departments is missing representation in one of the three job categories, then choose a different academic department. The summarized data in dept_cat_summary will be helpful here.

#