Overview: For each question, show your R code that you used to answer each question in the provided chunks. When a written response is required, be sure to answer the entire question in complete sentences outside the code chunks. When figures are required, be sure to follow all requirements to receive full credit. Point values are assigned for every part of this analysis. Do not work with other students on this assignment. You are to complete this assignment by yourself.
Helpful: Make sure you knit the document as you go through the assignment. Check all your results in the created HTML file.
Submission: Submit via an electronic document on Canvas. Must be submitted as an HTML file generated in RStudio.
Universities are typically opaque, bureaucratic institutions. To be transparent to tax payers, many public schools, such as the University of North Carolina, openly report salary information. In this assignment, we will analyze this information from 2022 to answer pivotal questions that have endured over the course of time. This salary data for UNC-Chapel Hill faculty and staff was downloaded in CSV format and titled “Salary Data Export.csv”.
To answer all the questions, you will need the R package
tidyverse
to make figures and utilize dplyr
functions.
Make sure the CSV data file is contained in the folder of your
RMarkdown file. First, we start by using the read_csv
function from the readr
package found within the tidyverse.
The code below executes this process by creating a tibble in your R
environment named “salary”.
salary=read_csv("Salary Data Export.csv")
Now, we will explore the information that is contained in this dataset. The code below provides the names of the variables contained in the dataset.
names(salary)
## [1] "University" "Last_Name" "First_Name" "INIT" "Age"
## [6] "Hire_Date" "Category" "Salary" "Department" "Title"
Next, we will examine the type of data contains in these different variables.
str(salary,give.attr=F)
## spc_tbl_ [13,001 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ University: chr [1:13001] "UNC-CH" "UNC-CH" "UNC-CH" "UNC-CH" ...
## $ Last_Name : chr [1:13001] "Jin" "Tien" "McManus" "Rizvi" ...
## $ First_Name: chr [1:13001] "Guifeng" "Hsiao-Chuan" "James" "Imran" ...
## $ INIT : chr [1:13001] NA NA "M" NA ...
## $ Age : num [1:13001] 61 58 60 47 53 35 66 46 44 40 ...
## $ Hire_Date : chr [1:13001] "1/27/1997" "11/1/2008" "1/13/2020" "7/1/2018" ...
## $ Category : chr [1:13001] "IT Database Professional, Database Administrator" "IT Database Professional, Data Administrator" "Research Professional, Mathematics / Computing / Statistics" "Assistant Professor" ...
## $ Salary : num [1:13001] 99998 99998 99988 99988 99975 ...
## $ Department: chr [1:13001] "Carolina Population Center" "Carolina Population Center" "Renaissance Computing Inst" "Biomedical Engineering-UG" ...
## $ Title : chr [1:13001] "Applications Analyst" "Applications Analyst" "Research Scientist" "Assistant Professor" ...
You will notice that the variable “Hire_Date” is recorded as a character. The following code will first modify the original dataset to change this to a date variable with the format mm/dd/yyyy. Then, we will remove the hyphens to create a numeric variable as yyyymmdd. Finally, in the spirit of tidyverse, we will convert this data frame to a tibble.
salary$Hire_Date_Pre=as.Date(salary$Hire_Date, format="%m/%d/%Y")
salary$Hire_Date=as.numeric(gsub("-","",salary$Hire_Date_Pre))
salary=as_tibble(salary)
Now, we will use head()
to view of first five rows and
the modifications made to the original data. The rest of the assignment
will extend off this modified dataset named salary
which by
now should be in your global environment.
head(salary,5)
## # A tibble: 5 × 11
## University Last_Name First_Name INIT Age Hire_Date Category Salary
## <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl>
## 1 UNC-CH Jin Guifeng <NA> 61 19970127 IT Database Pro… 99998
## 2 UNC-CH Tien Hsiao-Chuan <NA> 58 20081101 IT Database Pro… 99998
## 3 UNC-CH McManus James M 60 20200113 Research Profes… 99988
## 4 UNC-CH Rizvi Imran <NA> 47 20180701 Assistant Profe… 99988.
## 5 UNC-CH O'HARA RICHARD J 53 19931213 Research Profes… 99975
## # ℹ 3 more variables: Department <chr>, Title <chr>, Hire_Date_Pre <date>
Create a new dataset named salary2
that only contains
the following variables:
“Last_Name”
“First_Name”
“Age”
“Hire_Date”
“Category”
“Salary”
“Department”
Then, use the names()
function to display the variable
names of salary2
.
#
Now, we modify salary2
. Rename the variables
“Last_Name”,“First_Name”, and “Hire_Date” to “Last”,“First”, and “Hire”,
respectively. Do this for a new dataset called salary3
and
use names()
to display the variable names of
salary3
.
#
Now, we modify salary3
. Create a new variable called
“HireYear” that only contains the first four digits of the variable
“Hire” in a new dataset named salary4
. Use the function
str()
to ensure that your new variable “HireYear” reports
the year of the date that the employee was hired. The variable
“HireYear” should be numeric.
#
Now, we modify salary4
. Create a new variable called
“YrsEmployed” which reports the number of full years the employee has
worked at UNC. Use the current year and assume that all employees are
hired January 1. Create a new dataset named salary5
and
again use str()
to display the variables in
salary5
.
#
Now, we modify salary5
to create our final dataset named
salary.final
. Use the pipe %>%
to make the
following changes:
Drop the variables “Hire”, “HireYear”, “First”, and “Last”.
Sort the observations first by “YrsEmployed” and then by “Salary”, both in ascending order from smallest to largest.
Rearrange the variables so that “YrsEmployed” and “Salary” are the first two variables in the dataset, in that order, without removing any of the other variables.
After you have used the %>%
to make these changes,
use the function head()
to display the first 12 rows of
salary.final
.
#
In these questions,find the answer using code, and then answer the question using complete sentences below the code. I want to see your answer calculated/presented in your code output and in your writing.
What is the difference between the mean salary of employees in my department and the mean salary of employees in Economics?
Code (2 Points):
#
Answer (1 Point): (Place Answer Here in Complete Sentences)
What proportion of all assistant professors make more than $90,000?
Code (2 Points):
#
Answer (1 Points): (Place Answer Here in Complete Sentences)
The categories “Instructor” and “Lecturer” represent employees who are primarily hired for teaching. For all the departments that have at least 3 employees who are primarily hired for teaching, I want you to calculate the total departmental spending for employees who are primarily hired for teaching. Of the departments with at least 3 instructors or lecturers, which department spends the most money for these teaching position and which department spends the least money for these teaching positions. I also want to know how many employees are primarily hired for teaching in each of these two departments. Your code output should show a table that shows the two departments, the number of employees hired for teaching, and the total spending.
Code (4 Points):
#
Answer (1 Points): (Place Answer Here in Complete Sentences)
In general, it is not good to build an analysis off summarized data. Typically, you would want to analyze relationships and build models off your raw data where each observation is an individual. The future questions are based on summarized data. By doing this, I am unable to make inference or draw conclusions about the population of employees since each observation (row) is no longer a person. Think about this as you answer the following questions.
Based off the data in salary.final
, create a grouped
summary based off combinations of “Department” and “Category”. Call the
new tibble dept_cat_summary
. Your summarized tibble,
dept_cat_summary
, should report all of the following
statistics with corresponding variable names in the following order.
“n” = number of employees for each combination
“salary” = average salary for each combination
“yrs” = average years employed for each combination
“age_vs_salary” = correlation between the age and salary for each combination
In the process, make sure you use ungroup()
with the
pipe %>%
to release the grouping so future work is no
longer group specific. Also, the cor()
can be used to
calculate the correlation between two variables.
Following the creation of dept_cat_summary
, prove that
your code showing the first 12 rows.
#
Using the summarized data in dept_cat_summary
, which 5
“Department” and “Category” combinations have the strongest correlation
between age and salary. Only consider “Department” and “Category”
combinations that have at least 10 total employees. You can write as
much code as you want, but your output should show a table (tibble or
data frame) that only lists the “Department”, “Category”, and the
“age_vs_salary” correlation for the 5 combinations you identified.
Code:
#
Create a scatter plot using geom_point()
along with
fitted linear regression lines using geom_smooth
showing
the linear relationship between average salary of Professors and the
average years employed of Professors. Put the average years employed on
the x-axis of your plot. For this plot, use the summarized data
in dept_cat_summary
and exclude Departments where there are
less than 5 professors.
Following the plot, please explain what this plot suggests about the relationship between the average years employed and the average salary of professors. Make reference to the figure and use descriptive adjectives (i.e. “strong”, “weak”, etc.) and terms (i.e. “positive”, “negative”, etc.) that are appropriate for discussing linear relationships. When consulting the graphic, think about or research online what the shaded region around the regression line indicates.
Code and Figure (2 Points):
#
Answer (2 Points): (Place Answer Here in Complete Sentences)
In dept_cat_summary
, there are 750 unique departments.
You can verify this by using
length(unique(dept_title_summary$Department))
. I want you
to select only 3 academic departments not previously discussed or
outputted in the assignment, and display the scatter plots and fitted
linear regression lines representing the relationship between the
average years employed and the average salary. Typically, the variable
we would want to predict is on the y-axis, so assume that we want to
predict the average salary. Use facet_wrap
so that each
department has its own figure. Remove the shading of the confidence
intervals for the regression lines. For this plot, use the
summarized data in dept_cat_summary
.
Code and Figure:
#
Most academic departments have at least one employee in each of the
three following categories: “Professor”,“Associate Professor”, and
“Assistant Professor”. I want you to choose 3 academic departments that
were not previously discussed and not used in your previous question
(Q12). These 3 academic departments have to have at least one employee
in each of the three categories mentioned. Then, I want you to construct
a tile plot where “Department” is on the x-axis and “Category” is on the
y-axis. The color of the tiles should be based off the average salary of
all employees in each combination of “Department” and “Category”. In the
tile, plot I only want to see the three departments you chose on the
x-axis and the three categories I chose on the y-axis. I don’t want to
see every job category. If you construct the plot and discover one or
more of the tiles are blank since one or more of your departments is
missing representation in one of the three job categories, then choose a
different academic department. The summarized data in
dept_cat_summary
will be helpful here.
#