Introduction

Today, we will work with daily water temperature and air temperature data observed for 31 rivers in Spain. The goal of this tutorial is to identify the best model for predicting the maximum water temperature given the maximum air temperature. In the preview below, W represents the daily maximum water temperature and A represents the daily maximum air temperature. The data contains almost a full year of data for each of the 31 different rivers.

## # A tibble: 6 × 8
##   JULIAN_DAY  YEAR     L     W     A  TIME MONTH   DAY
##        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1          1  2003   103  14.2  21.2     1     1     1
## 2          2  2003   103  14.4  16.8     2     1     2
## 3          3  2003   103  14.4  15.4     3     1     3
## 4          4  2003   103  10.9  10.8     4     1     4
## 5          5  2003   103  10.8  11.7     5     1     5
## 6          6  2003   103  10.7  12.4     6     1     6

Part 1: Examining the Relationship

Chunk 1: Overall Relationship

Chunk 2: Location-Specific Relationship

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Chunk 3: Split Data into Train and Test Sets

set.seed(216)
TEST.LOCATIONS=sample(x=unique(DATA$L),size=3,replace=F)

TRAIN = anti_join(DATA,tibble(L=TEST.LOCATIONS),by="L")
TEST = semi_join(DATA,tibble(L=TEST.LOCATIONS),by="L")

Chunk 4: Plots of Relationship for Train and Test Data

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Part 2: Linear Regression Model

Chunk 1: Fitting Linear Model to Train Data

## 
## Call:
## lm(formula = W ~ A, data = TRAIN)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.1495  -2.1024  -0.1857   1.8851  16.8637 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) 3.394990   0.087161   38.95 <0.0000000000000002 ***
## A           0.649422   0.004101  158.36 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.075 on 8866 degrees of freedom
##   (1371 observations deleted due to missingness)
## Multiple R-squared:  0.7388, Adjusted R-squared:  0.7388 
## F-statistic: 2.508e+04 on 1 and 8866 DF,  p-value: < 0.00000000000000022

Chunk 2: Getting Predictions from Linear Model

TRAIN2 = TRAIN %>% add_predictions(linmod,var="linpred")
TEST2 = TEST %>% add_predictions(linmod,var="linpred")

Chunk 3: Getting Residuals from Linear Model

TRAIN3 = TRAIN2 %>% add_residuals(linmod,var="linres")
TEST3 = TEST2 %>% add_residuals(linmod,var="linres")

Part 3: Polynomial Regression Model

Chunk 1: Fitting Polynomial Regression Models

poly2mod=lm(W~A+I(A^2),data=TRAIN)
poly3mod=lm(W~A+I(A^2)+I(A^3),data=TRAIN)
poly4mod=lm(W~A+I(A^2)+I(A^3)+I(A^4),data=TRAIN)
anova(linmod,poly2mod,poly3mod,poly4mod,test="Chisq")

## Analysis of Variance Table
## 
## Model 1: W ~ A
## Model 2: W ~ A + I(A^2)
## Model 3: W ~ A + I(A^2) + I(A^3)
## Model 4: W ~ A + I(A^2) + I(A^3) + I(A^4)
##   Res.Df   RSS Df Sum of Sq              Pr(>Chi)    
## 1   8866 83855                                       
## 2   8865 83553  1    302.01         0.00000001284 ***
## 3   8864 82840  1    713.09 < 0.00000000000000022 ***
## 4   8863 82730  1    109.48             0.0006152 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Chunk 2: Getting Predictions from Polynomial Models

TRAIN4 =TRAIN3 %>% 
  add_predictions(poly2mod,var="poly2pred") %>%
  add_predictions(poly3mod,var="poly3pred") %>%
  add_predictions(poly4mod,var="poly4pred")
  
TEST4 =TEST3 %>% 
  add_predictions(poly2mod,var="poly2pred") %>%
  add_predictions(poly3mod,var="poly3pred") %>%
  add_predictions(poly4mod,var="poly4pred")

Chunk 3: Getting Residuals from Polynomial Models

TRAIN5 =TRAIN4 %>% 
  add_residuals(poly2mod,var="poly2res") %>%
  add_residuals(poly3mod,var="poly3res") %>%
  add_residuals(poly4mod,var="poly4res")

TEST5 =TEST4 %>% 
  add_residuals(poly2mod,var="poly2res") %>%
  add_residuals(poly3mod,var="poly3res") %>%
  add_residuals(poly4mod,var="poly4res")

Part 4: Nonlinear Logistic Model

Chunk 1: Logistic Model Investigation

Chunk 2: Essential Functions for Nonlinear Logistic Model

logistic.model=function(COEF,DATA){
  pred=COEF[1]+COEF[2]/(1+exp(COEF[3]-COEF[4]*DATA$A))
}

MSE.logistic=function(COEF,DATA){
  error=DATA$W-logistic.model(DATA=DATA,COEF=COEF)
  sq.error=error^2
  mse=mean(sq.error,na.rm=T)
  return(mse)
}

Chunk 3: Estimate Parameters for Nonlinear Logistic Model

logistic.mod=optim(
  par=c(min(TRAIN$W,na.rm=T),
        max(TRAIN$W,na.rm=T)-min(TRAIN$W,na.rm=T),
        mean(TRAIN$A,na.rm=T),
        1),           #Smart Starting Values
  fn=MSE.logistic,    #Function to Minimize
  DATA=TRAIN          #Required Argument
)
print(logistic.mod)

## $par
## [1]  2.97856812 47.85522690  2.34811569  0.06839438
## 
## $value
## [1] 9.665992
## 
## $counts
## function gradient 
##      501       NA 
## 
## $convergence
## [1] 1
## 
## $message
## NULL

Chunk 4: Obtain Predictions and Residuals

TRAIN6=TRAIN5 %>% mutate(logpred=logistic.model(COEF=logistic.mod$par,DATA=TRAIN5),
                         logres=W-logpred)
TEST6=TEST5 %>% mutate(logpred=logistic.model(COEF=logistic.mod$par,DATA=TEST5),
                         logres=W-logpred)

Intermission:

The function save.image() in R can be used to save all objects in the global environment. This is very helpful when you want work off your results without rerunning all previous R code. The name of the exported information should contain the file extension .Rdata. These files can be extremely large depending how much RAM was utilized in your R session. The function load() can be used to import a previous workspace.

For more information on .Rdata file types, see https://fileinfo.com/extension/rdata for help.

save.image("Tutorial.Rdata")

Part 5: Evaluation by Visualization

Chunk 1: Plotting Models

Chunk 2: Predicted Versus Actual Max Water Temperature

Chunk 3: Plotting Residuals Versus Time

Chunk 4: Evaluating The Location-Specific Error

Part 6: Evaluation by Numerical Summary

Chunk 1: Functions Required For Evaluating Prediction

bias.func=function(res){
  bias=mean(res,na.rm=T)
  return(bias)
}

mae.func=function(res){
  mae=mean(abs(res),na.rm=T)
  return(mae)
}

rmse.func=function(res){
  mse=mean(res^2,na.rm=T)
  rmse=sqrt(mse)
  return(rmse)
}

Chunk 2: Checking Functions

ex.res=TEST6$linres
c(bias.func(ex.res),mae.func(ex.res),rmse.func(ex.res))

## [1] 0.9534126 2.7503233 3.3515944

ex.res.mat=TEST6 %>% select(linres,poly2res,poly3res,poly4res,logres)
apply(ex.res.mat,2,bias.func)

##    linres  poly2res  poly3res  poly4res    logres 
## 0.9534126 0.9742415 0.9903951 0.9920042 0.7856188

apply(ex.res.mat,2,mae.func)

##   linres poly2res poly3res poly4res   logres 
## 2.750323 2.732399 2.706833 2.715366 2.676407

apply(ex.res.mat,2,rmse.func)

##   linres poly2res poly3res poly4res   logres 
## 3.351594 3.344867 3.328889 3.338710 3.291719

Chunk 3: Get Table Quickly

## # A tibble: 5 × 4
##   Model       MB   MAE  RMSE
##   <fct>    <dbl> <dbl> <dbl>
## 1 Linear   0.953  2.75  3.35
## 2 Poly(2)  0.974  2.73  3.34
## 3 Poly(3)  0.990  2.71  3.33
## 4 Poly(4)  0.992  2.72  3.34
## 5 Logistic 0.786  2.68  3.29

Chunk 4: HTML Formatted Table

Model	MB	MAE	RMSE
Linear	0.9534	2.7503	3.3516
Poly(2)	0.9742	2.7324	3.3449
Poly(3)	0.9904	2.7068	3.3289
Poly(4)	0.9920	2.7154	3.3387
Logistic	0.7856	2.6764	3.2917

Supplement for Modeling 3

Mario Giacomazzo

March 31, 2023