Instructions

Overview: For each question, show your R code that you used to answer each question in the provided chunks. When a written response is required, be sure to answer the entire question in complete sentences outside the code chunks. When figures are required, be sure to follow all requirements to receive full credit. Point values are assigned for every part of this analysis.

Helpful: Make sure you knit the document as you go through the assignment. Check all your results in the created HTML file.

Submission: Submit via an electronic document on Canvas. Must be submitted as an HTML file generated in RStudio.

Introduction

The rivers of the world are home to numerous fish species whose existence is dependent on the temperature of the water. Specifically for salmonid, read this article by Katharine Carter, an environmental scientist and lover of fish from sunny California. Salmonid varieties all thrive under different temperature ranges and issues arise when river temperatures are outside these ranges.

It’s a cool place and they say it gets colder. You’re bundled up now wait ’til you get older.

As we have been notified, it’s getting hot in here. Global warming is happening, and these fish are getting heatstroke. You can take my snow, but hands off my fish. I need that high quality protein and omega-3 fatty acids for mad gains. Because of these “fake” facts, protectors of the water have become interested in developing predictive models for maximum water temperature.

But the meteor men beg to differ. Judging by the hole in the satellite picture.

Below is a preview of a dataset containing close to a full year of daily observed maximum air and maximum water temperatures for 31 different rivers in Spain. The variable D represents the Julian day and takes values from 1 to 365. The factor variable L identifies 31 different measurement station locations. Variables W and A are the maximum water and air temperatures, respectively. Finally, the variable named T for time maintains the order of the data according to when it was observed. For the sake of our sanity, all days missing important information have been removed using na.omit().

DATA=na.omit(read_csv("AirWaterTemp.csv"))
DATA$L = as.factor(DATA$L)
glimpse(DATA)

## Rows: 9,857
## Columns: 5
## $ D <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 2…
## $ L <fct> 103, 103, 103, 103, 103, 103, 103, 103, 103, 103, 103, 103, 103, 103…
## $ W <dbl> 14.2, 14.4, 14.4, 10.9, 10.8, 10.7, 10.3, 10.1, 9.8, 10.5, 10.5, 9.9…
## $ A <dbl> 21.2, 16.8, 15.4, 10.8, 11.7, 12.4, 13.2, 11.8, 5.1, 3.5, 5.3, 6.2, …
## $ T <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 2…

The ice we skate is getting pretty thin. The water’s getting warm so you might as well swim.

What is the point of stealing your tuition, if I cannot grab some fish? The model below is an expression of a family of polynomial regression models that use A and D to explain the variation in W. Every choice of \(I\) and \(J\) leads to a different model. For our purpose, we would like to consider all possible subset models where \(I\leq 9\) and \(J\leq 9\). To choose the best model, we rely on cross-validation (CV) to estimate out-of-sample mean absolute error (MAE). The best choice of \(I\) and \(J\) minimize prediction error.

\[W = a+\sum\limits_{i=1}^I b_i A^i + \sum\limits_{j=1}^J c_j D^j + \epsilon\]

My world’s on fire. How about yours? That’s the way I like it and I’ll never get bored.

Where every you see COMPLETE, there is code required for you to write. Lines of code where you see #Do Not Change are meant to not be touched. The output that results from these lines is what will be graded. As you go through the assignment, knit the document to HTML. Make sure you change eval=F to eval=T. Check your final HTML document before you submit.

Assignment

Part 1: Cross-Validated MAE for Each Choice of \(I\) and \(J\)

Q1 (2 Points)

In a previous lecture, I demonstrated how to use 10-Fold CV to obtain an out-of-sample RMSE when \(I=4\) and \(J=3\). In this assignment, we will use 25 folds. Using crossv_kfold(), create a new data frame called DATA2 that contains two new variables train and test that are list-columns.

Code and Output:

set.seed(216) #Do Not Change

COMPLETE

head(DATA2) #Do Not Change

Q2 (2 Points)

Create a function called MAE.func() that takes two vector arguments called actual and predict representing actual responses in the data and predicted responses from a model, respectively. This function should output the Mean Absolute Error, which measures the overall error between actual and predicted values.

Code and Output:

x=c(1,3,4) #Do Not Change
y=c(0,0,0) #Do Not Change

COMPLETE

MAE.func(actual=x,predict=y) #Do Not Change

Q3 (8 Points)

For a specific \(I\) and \(J\), the following function fits the desired polynomial model to a given set of data. This function can be utilized to fit polynomial regression models of varying degrees.

train.model.func=function(data,I,J){
  mod=lm(W~poly(A,I)+poly(D,J),data=data)
  return(mod)
}

In the code chunk below, we begin by initiating an empty \(9 \times 9\) matrix of missing values called OUT.MAE. Each row corresponds to a different choice of \(I\) and each column corresponds to a different choice of \(J\). Use a double loop that performs 25-Fold CV to estimate out-of-sample MAE under each polynomial model where \(I \in \{1,2,3, \cdots,9\}\) and \(J \in \{1,2,3, \cdots,9\}\) and then saves the MAE in the \((I,J)\)-cell of the matrix OUT.MAE. Make sure you use the 25-folds created in DATA2. Do not worry about the warning/error message associated with unnest().

Code and Output:

OUT.MAE=matrix(NA,9,9) #Do Not Change

COMPLETE

print(OUT.MAE) #Do Not Change

Part 2: Comparing Top 3 Models

In the code chunk below, we start by making the summarized information in OUT.MAE tidy. There are three columns in OUT.MAE2 that links the cross-validated MAE to all considered combinations of \(I\) and \(J\). Change eval=F to eval=T before knitting to HTML.

OUT.MAE2=as.tibble(OUT.MAE) %>% 
  mutate(I=1:9) %>% 
  rename(`1`=V1,`2`=V2,`3`=V3,`4`=V4,`5`=V5,`6`=V6,`7`=V7, `8`=V8, `9`=V9) %>%
  select(I,everything()) %>%
  gather(`1`:`9`,key="J",value="MAE",convert=T) %>%
  mutate(I=as.factor(I),J=as.factor(J))
head(OUT.MAE2)

Q1 (2 Points)

Create a tibble called BEST3.MAE which contains the rows in OUT.MAE2 corresponding to the best three models according to the MAE and sorted from best to worst.

Code and Output:

COMPLETE

BEST3.MAE #Do Not Change

Q2 (3 Points)

Now, observe the figure below that shows the change in maximum water temperature (\(W\)) across Julian days (\(D\)). Change eval=F to eval=T before knitting to HTML.

ggplot(DATA) +
  geom_point(aes(x=D,y=W),alpha=0.05,stroke=0) +
  theme_minimal() +
  xlab("Julian Day")+
  ylab("Max Water Temperature")

Using mutate(), we create a tibble BEST3.DATA based off DATA. The new object BEST3.DATA contains 3 columns of predictions under the top 3 models based on the values of I and J in BEST3.MAE. The three columns of predictions should be given names First, Second, and Third in order from best to worst. Change eval=F to eval=T before knitting to HTML.

BEST3.DATA=DATA %>%
            mutate(First=predict(lm(W~poly(A,as.numeric(BEST3.MAE$I[1]))+poly(D,as.numeric(BEST3.MAE$J[1])),data=DATA)),
                   Second=predict(lm(W~poly(A,as.numeric(BEST3.MAE$I[2]))+poly(D,as.numeric(BEST3.MAE$J[2])),data=DATA)),
                   Third=predict(lm(W~poly(A,as.numeric(BEST3.MAE$I[3]))+poly(D,as.numeric(BEST3.MAE$J[3])),data=DATA)))

Then, I want you to use the pipe %>% with gather() to create a new tibble called BEST3.DATA2 that gathers all the predictions. A variable named Model should contain the values “First”,“Second”, and “Third”. A variable named Predict should contain the predictions corresponding to the appropriate models. In gather(), be sure to set factor_key=T to ensure that the new variable Model is a factor variable with ordered levels logically from “First” to “Third”.

Code and Output:

COMPLETE

head(BEST3.DATA2) #Do Not Change

Q3 (5 Points)

Create a figure for location 920 that overlays the raw and predicted maximum water temperatures for the top 3 models given the Julian Day. The raw data needs to be shown in a scatter plot using geom_point() with alpha=0.05 and stroke=0 . The predictions should be created using geom_line() with different colors for each of the Models. Label the x-axis “Julian Day” and the y-axis “Max Water Temperature”. In a complete sentence, explain how the predicted maximum water temperatures differ for the top 3 models when applied to location 920.

Code and Output (3 Points):

ggplot(BEST3.DATA2) +
  COMPLETE

Answer (2 Points): ANSWER IN COMPLETE SENTENCES HERE

Q4 (2 Points)

The following two figures show the marginal change in the average out-of-sample MAE as \(I\) and \(J\) increase. Based on these figures, what \(I\) and \(J\) would you recommend going forward? Critically, think about what these figures are telling us. Give a reason for your answer based on these graphics. Answer the question below the two figures in complete sentences. Change eval=F to eval=T before knitting to HTML.

Answer: ANSWER IN COMPLETE SENTENCES HERE

Part 3: Additional Models for Comparison

Q1 (5 Points)

If we choose the best \(I\) and \(J\) based off minimization of cross-validated MAE, we would choose \(I=2\) and \(J=6\). The MAE of this model can be found in the matrix OUT.MAE.

We want to see if the inclusion of an interaction variable \(AD\) in a new model improves the MAE. When fitting the model, you have to use the I(A*D) function in your statement of the model to tell the model to ONLY add the interaction between \(A\) and \(D\). The model I want you to assess is below:

\[W = a+\sum\limits_{i=1}^2 b_i A^i + \sum\limits_{j=1}^6 c_j D^j + d(AD) + \epsilon\]

Now, I want you to use the folds created in DATA2to perform 25-fold CV to estimate the out-of-sample MAE of the above model. We want to see if the inclusion of the interaction between maximum air temperature and Julian day improves the polynomial model. The only output from your code should be the cross-validated MAE from 25-fold CV. Finally, in a complete sentence, do you think the interaction variable improved the polynomial model where \(I=2\) and \(J=6\). Provide actual numerical evidence in your answer when you make your point.

Code and Output (3 Points):

Answer (2 Points): ANSWER IN COMPLETE SENTENCES HERE

Q2 (5 Points)

Up until this point, we have ignored the location variable \(L\). This variable is a categorical variable currently represented in the data as a factor variable with 31 levels. The 25 folds created in DATA2 each contain the variable \(L\) which we have ignored. We want to see if the inclusion of the location variable \(L\) in the polynomial model with \(I=2\) and \(J=6\) improves the model.

Now, I want you to use the folds created in DATA2to perform 25-fold CV to estimate the out-of-sample MAE of the polynomial model with the location variable \(L\). The only output from your code should be the cross-validated MAE from 25-fold CV. Finally, in a complete sentence, do you think the variable \(L\) improved the polynomial model where \(I=2\) and \(J=6\). Provide actual numerical evidence in your answer when you make your point.

Code and Output (3 Points):

Answer (2 Points): ANSWER IN COMPLETE SENTENCES HERE

Q3 (6 Points):

Now, I want you to create a visual comparing the “best” polynomial model to the last two models we evaluated which made small modifications to the “best” polynomial model. You will need to fit all three of these models to the full dataset and use the predict function to obtain fitted/predicted values on the full dataset. Add columns named “Best Poly”, “Best Poly + Interact”, and “Best Poly + L” to save these fitted/predicted values of the three mentioned models in the order we evaluated them. Then, use the gather function to gather the predictions of these three models similar to how we did it before.

Finally, I want you to create a figure for location 920 that overlays the raw and predicted maximum water temperatures for these 3 models given the Julian Day. The figure should look very similar to a previous figure except the legend will look different since we gave our models here different names. The raw data needs to be shown in a scatter plot using geom_point() with alpha=0.05 and stroke=0 . The predictions should be created using geom_line() with different colors for each of the Models. Label the x-axis “Julian Day” and the y-axis “Max Water Temperature”.

The only output from this code should be the visual.

Code and Output:

Analysis 4: Model Selection Via K-Fold CV

FIRSTNAME LASTNAME

May 14, 2025

Instructions

Introduction

Assignment

Part 1: Cross-Validated MAE for Each Choice of \(I\) and \(J\)

Q1 (2 Points)

Q2 (2 Points)

Q3 (8 Points)

Part 2: Comparing Top 3 Models

Q1 (2 Points)

Q2 (3 Points)

Q3 (5 Points)

Q4 (2 Points)

Part 3: Additional Models for Comparison

Q1 (5 Points)

Q2 (5 Points)

Q3 (6 Points):