In this lab, you will build predictive models for board game ratings. The dataset below was scraped from boardgamegeek.com and contains information on the top 4,999 board games. Below, you will see a preview of the data
bgg<-read.csv("bgg.csv")
bgg2=bgg[,c(4:13,15:20)]
head(bgg2)
## names min_players max_players
## 1 Gloomhaven 1 4
## 2 Pandemic Legacy: Season 1 2 4
## 3 Through the Ages: A New Story of Civilization 2 4
## 4 Terraforming Mars 1 5
## 5 Twilight Struggle 2 2
## 6 Star Wars: Rebellion 2 4
## avg_time min_time max_time year avg_rating geek_rating num_votes age
## 1 120 60 120 2017 8.98893 8.61858 15376 12
## 2 60 60 60 2015 8.66140 8.50163 26063 13
## 3 240 180 240 2015 8.60673 8.30183 12352 14
## 4 120 120 120 2016 8.38461 8.19914 26004 12
## 5 180 120 180 2005 8.33954 8.19787 31301 13
## 6 240 180 240 2016 8.47439 8.16545 13336 14
## mechanic
## 1 Action / Movement Programming, Co-operative Play, Grid Movement, Hand Management, Modular Board, Role Playing, Simultaneous Action Selection, Storytelling, Variable Player Powers
## 2 Action Point Allowance System, Co-operative Play, Hand Management, Point to Point Movement, Set Collection, Trading, Variable Player Powers
## 3 Action Point Allowance System, Auction/Bidding, Card Drafting
## 4 Card Drafting, Hand Management, Set Collection, Tile Placement, Variable Player Powers
## 5 Area Control / Area Influence, Campaign / Battle Card Driven, Dice Rolling, Hand Management, Simultaneous Action Selection
## 6 Area Control / Area Influence, Area Movement, Dice Rolling, Hand Management, Partnerships, Variable Player Powers
## owned
## 1 25928
## 2 41605
## 3 15848
## 4 33340
## 5 42952
## 6 20682
## category
## 1 Adventure, Exploration, Fantasy, Fighting, Miniatures
## 2 Environmental, Medical
## 3 Card Game, Civilization, Economic
## 4 Economic, Environmental, Industry / Manufacturing, Science Fiction, Territory Building
## 5 Modern Warfare, Political, Wargame
## 6 Fighting, Miniatures, Movies / TV / Radio theme, Science Fiction, Wargame
## designer weight
## 1 Isaac Childres 3.7543
## 2 Rob Daviau, Matt Leacock 2.8210
## 3 Vlaada Chvátil 4.3678
## 4 Jacob Fryxelius 3.2456
## 5 Ananda Gupta, Jason Matthews 3.5518
## 6 Corey Konieczka 3.6311
There are 16 variables and we want to create some more. Create a new dataframe called \(bgg3\) where you use the mutate function to create the following variables:
head(bgg3)
Question: In complete sentences, what is the purpose of adding 1 for the log transformed variables?
YOUR ANSWER IN COMPLETE SENTENCES
Question: In complete sentences, what is the purpose of adding 1 in the creation of the year variable?
YOUR ANSWER IN COMPLETE SENTENCES
We hypothesize the geek rating increases when the number of votes increases and/or the ownership increases. Create four scatter plots showing the association with geek_rating and the following variables:
Question: In complete sentences, describe how the relationship changes when you take the log of the independent variable.
YOUR ANSWER IN COMPLETE SENTENCES
Randomly sample approximately 80% of the data in bgg3
for a training dataset and the remaining will act as a test set. Call
the training dataset train.bgg
and the testing dataset
test.bgg
.
set.seed(COMPLETE)
bgg4= bgg3 %>%
mutate(Set=sample(COMPLETE))
train.bgg<-filter(bgg4,Set=="Train")
test.bgg<-filter(bgg4,Set=="Test")
Now, we want to fit models to the training dataset. Use the
lm()
function to create 3 model objects in R called
lm1
, lm2
, lm3
based on the
following linear models, respectively:
lm1 = lm(COMPLETE,data=train.bgg)
lm2 = lm(COMPLETE,data=train.bgg)
lm3 = lm(COMPLETE,data=train.bgg)
Add predictions and residuals for all 3 models to the test set.
Create a new data frame called test.bgg2
and give all your
predictions and residuals different names. Use the str()
function to show these variables were created
str(test.bgg2)
Create a function called MAE.func()
that returns the
mean absolute error based on a vector of the residuals and test your
function on the vector called test
.
Solution 1:
test=c(-5,-2,0,3,5)
MAE.func(test)
Use your function on the test.bgg2
to calculate the
out-of-sample MAE of all three models based on the associated residuals.
Make sure you display the mean absolute error from these different
models in your output.
Question: Which model does the best job at predicting the geek rating of these board games?
YOUR ANSWER IN COMPLETE SENTENCES
For the third model only, use 10-fold cross-validation and measure the out-of-sample mean absolute error. Print out the final cross-validated mean absolute error.
Question: What is the absolute difference between the out-of-sample mean absolute error measured using a test set and the mean absolute error measured using cross validation? When you type your answer in complete sentences use inline R code to calculate the absolute difference and input it directly into your sentence.
YOUR ANSWER IN COMPLETE SENTENCES