Introduction

We will continue our work with daily water temperature and air temperature data observed for 31 rivers in Spain. In the preview below, W represents the daily maximum water temperature and A represents the daily maximum air temperature. The data contains almost a full year of data for each of the 31 different rivers.

## # A tibble: 6 x 8
##   JULIAN_DAY  YEAR     L     W     A  TIME MONTH   DAY
##        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1          1  2003   103  14.2  21.2     1     1     1
## 2          2  2003   103  14.4  16.8     2     1     2
## 3          3  2003   103  14.4  15.4     3     1     3
## 4          4  2003   103  10.9  10.8     4     1     4
## 5          5  2003   103  10.8  11.7     5     1     5
## 6          6  2003   103  10.7  12.4     6     1     6

Using the data, we seek to identify the best model for predicting the maximum water temperature given the maximum air temperature. Previously, we randomly selected 3 rivers to act as a test set. All models were evaluated based on the randomly selected test set. In this tutorial, we explore approaches that ensure that all data is used for both model training and model testing.

In this tutorial, we apply helpful functions in the purrr, modelr, and broom packages. See the following links for helpful articles on performing cross-validation within the tidyverse: Link 1, Link 2, and Link 3.

Part 1: Intelligent Use of Locations for Cross-Validation

Chunk 1: List-Column of Data Split By Location

NEST.DATA = DATA %>% group_by(L) %>% nest()
head(NEST.DATA)
## # A tibble: 6 x 2
## # Groups:   L [6]
##       L data              
##   <dbl> <list>            
## 1   103 <tibble [365 x 7]>
## 2   105 <tibble [366 x 7]>
## 3   107 <tibble [365 x 7]>
## 4   111 <tibble [365 x 7]>
## 5   112 <tibble [366 x 7]>
## 6   118 <tibble [366 x 7]>

Chunk 2: Combining filter() with unnest() To Split Data

NEST.DATA %>% filter(L==103) %>% unnest() %>% glimpse()
## Rows: 365
## Columns: 8
## Groups: L [1]
## $ L          <dbl> 103, 103, 103, 103, 103, 103, 103, 103, 103, 103, 103, 1...
## $ JULIAN_DAY <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1...
## $ YEAR       <dbl> 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 20...
## $ W          <dbl> 14.2, 14.4, 14.4, 10.9, 10.8, 10.7, 10.3, 10.1, 9.8, 10....
## $ A          <dbl> 21.2, 16.8, 15.4, 10.8, 11.7, 12.4, 13.2, 11.8, 5.1, 3.5...
## $ TIME       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1...
## $ MONTH      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ DAY        <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1...
NEST.DATA %>% filter(L!=103) %>% unnest() %>% glimpse()
## Rows: 10,972
## Columns: 8
## Groups: L [30]
## $ L          <dbl> 105, 105, 105, 105, 105, 105, 105, 105, 105, 105, 105, 1...
## $ JULIAN_DAY <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1...
## $ YEAR       <dbl> 2008, 2008, 2008, 2008, 2008, 2008, 2008, 2008, 2008, 20...
## $ W          <dbl> 7.9, 7.8, 8.7, 9.0, 9.4, 10.4, 9.9, 9.8, 10.3, NA, 9.5, ...
## $ A          <dbl> 14.0, 15.0, 12.4, 14.6, 18.0, 18.2, 12.0, 14.2, 16.3, NA...
## $ TIME       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1...
## $ MONTH      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ DAY        <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1...

Chunk 3: Fit Train Data, Predict Test Data, and Save Results

DATA2=DATA
DATA2$linpred=NA

TEST = NEST.DATA %>% filter(L==103) %>% unnest()
TRAIN  = NEST.DATA %>% filter(L!=103) %>% unnest()

linmod=lm(W~A,data=TRAIN)
linmodpred=predict(linmod,newdata=TEST)

DATA2$linpred[which(DATA2$L==103)]=linmodpred
head(DATA2)
## # A tibble: 6 x 9
##   JULIAN_DAY  YEAR     L     W     A  TIME MONTH   DAY linpred
##        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1          1  2003   103  14.2  21.2     1     1     1    17.3
## 2          2  2003   103  14.4  16.8     2     1     2    14.4
## 3          3  2003   103  14.4  15.4     3     1     3    13.5
## 4          4  2003   103  10.9  10.8     4     1     4    10.5
## 5          5  2003   103  10.8  11.7     5     1     5    11.1
## 6          6  2003   103  10.7  12.4     6     1     6    11.5

Chunk 4: Create a Loop to Iterate Process for Each Location

#

Chunk 5: Calcuate Cross-Validated RMSE

RMSE.func=function(actual,predict){
  
}
RMSE.func(actual=DATA2$W,predict=DATA2$linpred)

#Part 2: K-Fold CV for Polynomial Model Evaluation

##Chunk 1: Exploratory Figures

##Chunk 2: Polynomial Fitting

polymodel=lm(W~poly(A,4)+poly(JULIAN_DAY,3),data=na.omit(DATA))
tidy(polymodel)
glance(polymodel)

Chunk 3: Splitting Data for Cross-Validation

DATA3=na.omit(DATA) %>% crossv_kfold(10)
head(DATA3)

Chunk 4: Fitted Models and Predictions

train.model.func=function(data,i,j){
  mod=lm(W~poly(A,i)+poly(JULIAN_DAY,j),data=data)
  return(mod)
}

i=4
j=3

DATA4=DATA3 %>% 
       mutate(tr.model=map(train,train.model.func,i=i,j=j))
head(DATA4)

Chunk 5: Predicted Values and Cross-Validated RMSE

DATA4.PREDICT = DATA4 %>% 
          mutate(predict=map2(test,tr.model,~augment(.y,newdata=.x))) %>%
          select(predict) %>%
          unnest()
head(DATA4.PREDICT)
RMSE.func(actual=DATA4.PREDICT$W,predict=DATA4.PREDICT$.fitted)