We will continue our work with daily water temperature and air temperature data observed for 31 rivers in Spain. In the preview below, W
represents the daily maximum water temperature and A
represents the daily maximum air temperature. The data contains almost a full year of data for each of the 31 different rivers.
## # A tibble: 6 x 8
## JULIAN_DAY YEAR L W A TIME MONTH DAY
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 2003 103 14.2 21.2 1 1 1
## 2 2 2003 103 14.4 16.8 2 1 2
## 3 3 2003 103 14.4 15.4 3 1 3
## 4 4 2003 103 10.9 10.8 4 1 4
## 5 5 2003 103 10.8 11.7 5 1 5
## 6 6 2003 103 10.7 12.4 6 1 6
Using the data, we seek to identify the best model for predicting the maximum water temperature given the maximum air temperature. Previously, we randomly selected 3 rivers to act as a test set. All models were evaluated based on the randomly selected test set. In this tutorial, we explore approaches that ensure that all data is used for both model training and model testing.
In this tutorial, we apply helpful functions in the purrr
, modelr
, and broom
packages. See the following links for helpful articles on performing cross-validation within the tidyverse: Link 1, Link 2, and Link 3.
NEST.DATA = DATA %>% group_by(L) %>% nest()
head(NEST.DATA)
## # A tibble: 6 x 2
## # Groups: L [6]
## L data
## <dbl> <list>
## 1 103 <tibble [365 x 7]>
## 2 105 <tibble [366 x 7]>
## 3 107 <tibble [365 x 7]>
## 4 111 <tibble [365 x 7]>
## 5 112 <tibble [366 x 7]>
## 6 118 <tibble [366 x 7]>
filter()
with unnest()
To Split DataNEST.DATA %>% filter(L==103) %>% unnest() %>% glimpse()
## Rows: 365
## Columns: 8
## Groups: L [1]
## $ L <dbl> 103, 103, 103, 103, 103, 103, 103, 103, 103, 103, 103, 1...
## $ JULIAN_DAY <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1...
## $ YEAR <dbl> 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 20...
## $ W <dbl> 14.2, 14.4, 14.4, 10.9, 10.8, 10.7, 10.3, 10.1, 9.8, 10....
## $ A <dbl> 21.2, 16.8, 15.4, 10.8, 11.7, 12.4, 13.2, 11.8, 5.1, 3.5...
## $ TIME <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1...
## $ MONTH <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ DAY <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1...
NEST.DATA %>% filter(L!=103) %>% unnest() %>% glimpse()
## Rows: 10,972
## Columns: 8
## Groups: L [30]
## $ L <dbl> 105, 105, 105, 105, 105, 105, 105, 105, 105, 105, 105, 1...
## $ JULIAN_DAY <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1...
## $ YEAR <dbl> 2008, 2008, 2008, 2008, 2008, 2008, 2008, 2008, 2008, 20...
## $ W <dbl> 7.9, 7.8, 8.7, 9.0, 9.4, 10.4, 9.9, 9.8, 10.3, NA, 9.5, ...
## $ A <dbl> 14.0, 15.0, 12.4, 14.6, 18.0, 18.2, 12.0, 14.2, 16.3, NA...
## $ TIME <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1...
## $ MONTH <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ DAY <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1...
DATA2=DATA
DATA2$linpred=NA
TEST = NEST.DATA %>% filter(L==103) %>% unnest()
TRAIN = NEST.DATA %>% filter(L!=103) %>% unnest()
linmod=lm(W~A,data=TRAIN)
linmodpred=predict(linmod,newdata=TEST)
DATA2$linpred[which(DATA2$L==103)]=linmodpred
head(DATA2)
## # A tibble: 6 x 9
## JULIAN_DAY YEAR L W A TIME MONTH DAY linpred
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 2003 103 14.2 21.2 1 1 1 17.3
## 2 2 2003 103 14.4 16.8 2 1 2 14.4
## 3 3 2003 103 14.4 15.4 3 1 3 13.5
## 4 4 2003 103 10.9 10.8 4 1 4 10.5
## 5 5 2003 103 10.8 11.7 5 1 5 11.1
## 6 6 2003 103 10.7 12.4 6 1 6 11.5
#
RMSE.func=function(actual,predict){
}
RMSE.func(actual=DATA2$W,predict=DATA2$linpred)
#Part 2: K-Fold CV for Polynomial Model Evaluation
##Chunk 1: Exploratory Figures
##Chunk 2: Polynomial Fitting
polymodel=lm(W~poly(A,4)+poly(JULIAN_DAY,3),data=na.omit(DATA))
tidy(polymodel)
glance(polymodel)
DATA3=na.omit(DATA) %>% crossv_kfold(10)
head(DATA3)
train.model.func=function(data,i,j){
mod=lm(W~poly(A,i)+poly(JULIAN_DAY,j),data=data)
return(mod)
}
i=4
j=3
DATA4=DATA3 %>%
mutate(tr.model=map(train,train.model.func,i=i,j=j))
head(DATA4)
DATA4.PREDICT = DATA4 %>%
mutate(predict=map2(test,tr.model,~augment(.y,newdata=.x))) %>%
select(predict) %>%
unnest()
head(DATA4.PREDICT)
RMSE.func(actual=DATA4.PREDICT$W,predict=DATA4.PREDICT$.fitted)