Overview: For each question, show your R code that you used to answer each question in the provided chunks. When a written response is required, be sure to answer the entire question in complete sentences outside the code chunks. When figures are required, be sure to follow all requirements to receive full credit. Point values are assigned for every part of this analysis.
Helpful: Make sure you knit the document as you go through the assignment. Check all your results in the created HTML file.
Submission: Submit via an electronic document on Canvas. Must be submitted as an HTML file generated in RStudio.
The rivers of the world are home to numerous fish species whose existence is dependent on the temperature of the water. Specifically for salmonid, read this article by Katharine Carter, an environmental scientist and lover of fish from sunny California. Salmonid varieties all thrive under different temperature ranges and issues arise when river temperatures are outside these ranges.
It’s a cool place and they say it gets colder. You’re bundled up now wait ’til you get older.
As we have been notified, it’s getting hot in here. Global warming is happening, and these fish are getting heatstroke. You can take my snow, but hands off my fish. I need that high quality protein and omega-3 fatty acids for mad gains. Because of these “fake” facts, protectors of the water have become interested in developing predictive models for maximum water temperature.
But the meteor men beg to differ. Judging by the hole in the satellite picture.
Below is a preview of a dataset containing close to a full year of
daily observed maximum air and maximum water temperatures for 31
different rivers in Spain. The variable D
represents the
Julian day and takes values from 1 to 365. The factor
variable L
identifies 31 different measurement station
locations. Variables W
and A
are the maximum
water and air temperatures, respectively. Finally, the variable named
T
for time maintains the order of the data according to
when it was observed. For the sake of our sanity, all days missing
important information have been removed using
na.omit()
.
DATA=na.omit(read_csv("AirWaterTemp.csv"))
DATA$L = as.factor(DATA$L)
glimpse(DATA)
## Rows: 9,857
## Columns: 5
## $ D <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 2…
## $ L <fct> 103, 103, 103, 103, 103, 103, 103, 103, 103, 103, 103, 103, 103, 103…
## $ W <dbl> 14.2, 14.4, 14.4, 10.9, 10.8, 10.7, 10.3, 10.1, 9.8, 10.5, 10.5, 9.9…
## $ A <dbl> 21.2, 16.8, 15.4, 10.8, 11.7, 12.4, 13.2, 11.8, 5.1, 3.5, 5.3, 6.2, …
## $ T <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 2…
The ice we skate is getting pretty thin. The water’s getting warm so you might as well swim.
What is the point of stealing your tuition, if I cannot grab some
fish? The model below is an expression of a family of polynomial
regression models that use A
and D
to explain
the variation in W
. Every choice of \(I\) and \(J\) leads to a different model. For our
purpose, we would like to consider all possible subset models where
\(I\leq 9\) and \(J\leq 9\). To choose the best model, we
rely on cross-validation (CV) to estimate out-of-sample mean absolute
error (MAE). The best choice of \(I\)
and \(J\) minimize prediction
error.
\[W = a+\sum\limits_{i=1}^I b_i A^i + \sum\limits_{j=1}^J c_j D^j + \epsilon\]
My world’s on fire. How about yours? That’s the way I like it and I’ll never get bored.
Where every you see COMPLETE
, there is code required for
you to write. Lines of code where you see #Do Not Change
are meant to not be touched. The output that results from these lines is
what will be graded. As you go through the assignment, knit the document
to HTML. Make sure you change eval=F
to
eval=T
. Check your final HTML document before you
submit.
In a previous lecture, I demonstrated how to use 10-Fold CV to obtain
an out-of-sample RMSE when \(I=4\) and
\(J=3\). In this assignment, we will
use 25 folds. Using crossv_kfold()
, create a new data frame
called DATA2
that contains two new variables
train
and test
that are list-columns.
Code and Output:
set.seed(216) #Do Not Change
COMPLETE
head(DATA2) #Do Not Change
Create a function called MAE.func()
that takes two
vector arguments called actual and predict
representing actual responses in the data and predicted responses from a
model, respectively. This function should output the Mean Absolute
Error, which measures the overall error between actual and predicted
values.
Code and Output:
x=c(1,3,4) #Do Not Change
y=c(0,0,0) #Do Not Change
COMPLETE
MAE.func(actual=x,predict=y) #Do Not Change
For a specific \(I\) and \(J\), the following function fits the desired polynomial model to a given set of data. This function can be utilized to fit polynomial regression models of varying degrees.
train.model.func=function(data,I,J){
mod=lm(W~poly(A,I)+poly(D,J),data=data)
return(mod)
}
In the code chunk below, we begin by initiating an empty \(9 \times 9\) matrix of missing values
called OUT.MAE
. Each row corresponds to a different choice
of \(I\) and each column corresponds to
a different choice of \(J\). Use a
double loop that performs 25-Fold CV to estimate out-of-sample MAE under
each polynomial model where \(I \in \{1,2,3,
\cdots,9\}\) and \(J \in \{1,2,3,
\cdots,9\}\) and then saves the MAE in the \((I,J)\)-cell of the matrix
OUT.MAE
. Make sure you use the 25-folds created in
DATA2
. Do not worry about the warning/error message
associated with unnest()
.
Code and Output:
OUT.MAE=matrix(NA,9,9) #Do Not Change
COMPLETE
print(OUT.MAE) #Do Not Change
In the code chunk below, we start by making the summarized
information in OUT.MAE
tidy. There are three columns in
OUT.MAE2
that links the cross-validated MAE to all
considered combinations of \(I\) and
\(J\). Change eval=F
to
eval=T
before knitting to HTML.
OUT.MAE2=as.tibble(OUT.MAE) %>%
mutate(I=1:9) %>%
rename(`1`=V1,`2`=V2,`3`=V3,`4`=V4,`5`=V5,`6`=V6,`7`=V7, `8`=V8, `9`=V9) %>%
select(I,everything()) %>%
gather(`1`:`9`,key="J",value="MAE",convert=T) %>%
mutate(I=as.factor(I),J=as.factor(J))
head(OUT.MAE2)
Create a tibble called BEST3.MAE
which contains the rows
in OUT.MAE2
corresponding to the best three models
according to the MAE and sorted from best to worst.
Code and Output:
COMPLETE
BEST3.MAE #Do Not Change
Now, observe the figure below that shows the change in maximum water
temperature (\(W\)) across Julian days
(\(D\)). Change eval=F
to
eval=T
before knitting to HTML.
ggplot(DATA) +
geom_point(aes(x=D,y=W),alpha=0.05,stroke=0) +
theme_minimal() +
xlab("Julian Day")+
ylab("Max Water Temperature")
Using mutate()
, we create a tibble
BEST3.DATA
based off DATA
. The new object
BEST3.DATA
contains 3 columns of predictions under the top
3 models based on the values of I
and J
in
BEST3.MAE
. The three columns of predictions should be given
names First
, Second
, and Third
in
order from best to worst. Change eval=F
to
eval=T
before knitting to HTML.
BEST3.DATA=DATA %>%
mutate(First=predict(lm(W~poly(A,as.numeric(BEST3.MAE$I[1]))+poly(D,as.numeric(BEST3.MAE$J[1])),data=DATA)),
Second=predict(lm(W~poly(A,as.numeric(BEST3.MAE$I[2]))+poly(D,as.numeric(BEST3.MAE$J[2])),data=DATA)),
Third=predict(lm(W~poly(A,as.numeric(BEST3.MAE$I[3]))+poly(D,as.numeric(BEST3.MAE$J[3])),data=DATA)))
Then, I want you to use the pipe %>%
with
gather()
to create a new tibble called
BEST3.DATA2
that gathers all the predictions. A variable
named Model
should contain the values “First”,“Second”, and
“Third”. A variable named Predict
should contain the
predictions corresponding to the appropriate models. In
gather()
, be sure to set factor_key=T
to
ensure that the new variable Model
is a factor variable
with ordered levels logically from “First” to “Third”.
Code and Output:
COMPLETE
head(BEST3.DATA2) #Do Not Change
Create a figure for location 920 that overlays the raw and predicted
maximum water temperatures for the top 3 models given the Julian Day.
The raw data needs to be shown in a scatter plot using
geom_point()
with alpha=0.05
and
stroke=0
. The predictions should be created using
geom_line()
with different colors for each of the Models.
Label the x-axis “Julian Day” and the y-axis “Max Water Temperature”. In
a complete sentence, explain how the predicted maximum water
temperatures differ for the top 3 models when applied to location
920.
Code and Output (3 Points):
ggplot(BEST3.DATA2) +
COMPLETE
Answer (2 Points): ANSWER IN COMPLETE SENTENCES HERE
The following two figures show the marginal change in the average
out-of-sample MAE as \(I\) and \(J\) increase. Based on these figures, what
\(I\) and \(J\) would you recommend going forward?
Critically, think about what these figures are telling us. Give a reason
for your answer based on these graphics. Answer the question below the
two figures in complete sentences. Change eval=F
to
eval=T
before knitting to HTML.
Answer: ANSWER IN COMPLETE SENTENCES HERE
If we choose the best \(I\) and
\(J\) based off minimization of
cross-validated MAE, we would choose \(I=2\) and \(J=6\). The MAE of this model can be found
in the matrix OUT.MAE
.
We want to see if the inclusion of an interaction variable \(AD\) in a new model improves the MAE. When
fitting the model, you have to use the I(A*D)
function in
your statement of the model to tell the model to ONLY add the
interaction between \(A\) and \(D\). The model I want you to assess is
below:
\[W = a+\sum\limits_{i=1}^2 b_i A^i + \sum\limits_{j=1}^6 c_j D^j + d(AD) + \epsilon\]
Now, I want you to use the folds created in DATA2
to
perform 25-fold CV to estimate the out-of-sample MAE of the above model.
We want to see if the inclusion of the interaction between maximum air
temperature and Julian day improves the polynomial model. The only
output from your code should be the cross-validated MAE from 25-fold CV.
Finally, in a complete sentence, do you think the interaction variable
improved the polynomial model where \(I=2\) and \(J=6\). Provide actual numerical evidence in
your answer when you make your point.
Code and Output (3 Points):
#
Answer (2 Points): ANSWER IN COMPLETE SENTENCES HERE
Up until this point, we have ignored the location variable \(L\). This variable is a categorical
variable currently represented in the data as a factor variable with 31
levels. The 25 folds created in DATA2
each contain the
variable \(L\) which we have ignored.
We want to see if the inclusion of the location variable \(L\) in the polynomial model with \(I=2\) and \(J=6\) improves the model.
Now, I want you to use the folds created in DATA2
to
perform 25-fold CV to estimate the out-of-sample MAE of the polynomial
model with the location variable \(L\).
The only output from your code should be the cross-validated MAE from
25-fold CV. Finally, in a complete sentence, do you think the variable
\(L\) improved the polynomial model
where \(I=2\) and \(J=6\). Provide actual numerical evidence in
your answer when you make your point.
Code and Output (3 Points):
#
Answer (2 Points): ANSWER IN COMPLETE SENTENCES HERE
Now, I want you to create a visual comparing the “best” polynomial model to the last two models we evaluated which made small modifications to the “best” polynomial model. You will need to fit all three of these models to the full dataset and use the predict function to obtain fitted/predicted values on the full dataset. Add columns named “Best Poly”, “Best Poly + Interact”, and “Best Poly + L” to save these fitted/predicted values of the three mentioned models in the order we evaluated them. Then, use the gather function to gather the predictions of these three models similar to how we did it before.
Finally, I want you to create a figure for location 920 that overlays
the raw and predicted maximum water temperatures for these 3 models
given the Julian Day. The figure should look very similar to a previous
figure except the legend will look different since we gave our models
here different names. The raw data needs to be shown in a scatter plot
using geom_point()
with alpha=0.05
and
stroke=0
. The predictions should be created using
geom_line()
with different colors for each of the Models.
Label the x-axis “Julian Day” and the y-axis “Max Water
Temperature”.
The only output from this code should be the visual.
Code and Output:
#