Introduction

In this lab, you will build predictive models for board game ratings. The dataset below was scraped from boardgamegeek.com and contains information on the top 4,999 board games. Below, you will see a preview of the data

bgg<-read.csv("bgg.csv")
bgg2=bgg[,c(4:13,15:20)]
head(bgg2)
##                                           names min_players max_players
## 1                                    Gloomhaven           1           4
## 2                     Pandemic Legacy: Season 1           2           4
## 3 Through the Ages: A New Story of Civilization           2           4
## 4                             Terraforming Mars           1           5
## 5                             Twilight Struggle           2           2
## 6                          Star Wars: Rebellion           2           4
##   avg_time min_time max_time year avg_rating geek_rating num_votes age
## 1      120       60      120 2017    8.98893     8.61858     15376  12
## 2       60       60       60 2015    8.66140     8.50163     26063  13
## 3      240      180      240 2015    8.60673     8.30183     12352  14
## 4      120      120      120 2016    8.38461     8.19914     26004  12
## 5      180      120      180 2005    8.33954     8.19787     31301  13
## 6      240      180      240 2016    8.47439     8.16545     13336  14
##                                                                                                                                                                             mechanic
## 1 Action / Movement Programming, Co-operative Play, Grid Movement, Hand Management, Modular Board, Role Playing, Simultaneous Action Selection, Storytelling, Variable Player Powers
## 2                                        Action Point Allowance System, Co-operative Play, Hand Management, Point to Point Movement, Set Collection, Trading, Variable Player Powers
## 3                                                                                                                      Action Point Allowance System, Auction/Bidding, Card Drafting
## 4                                                                                             Card Drafting, Hand Management, Set Collection, Tile Placement, Variable Player Powers
## 5                                                         Area Control / Area Influence, Campaign / Battle Card Driven, Dice Rolling, Hand Management, Simultaneous Action Selection
## 6                                                                  Area Control / Area Influence, Area Movement, Dice Rolling, Hand Management, Partnerships, Variable Player Powers
##   owned
## 1 25928
## 2 41605
## 3 15848
## 4 33340
## 5 42952
## 6 20682
##                                                                                 category
## 1                                  Adventure, Exploration, Fantasy, Fighting, Miniatures
## 2                                                                 Environmental, Medical
## 3                                                      Card Game, Civilization, Economic
## 4 Economic, Environmental, Industry / Manufacturing, Science Fiction, Territory Building
## 5                                                     Modern Warfare, Political, Wargame
## 6              Fighting, Miniatures, Movies / TV / Radio theme, Science Fiction, Wargame
##                       designer weight
## 1               Isaac Childres 3.7543
## 2     Rob Daviau, Matt Leacock 2.8210
## 3               Vlaada Chvátil 4.3678
## 4              Jacob Fryxelius 3.2456
## 5 Ananda Gupta, Jason Matthews 3.5518
## 6              Corey Konieczka 3.6311

Board Game Analysis

Q1 (1.5 Points)

There are 16 variables and we want to create some more. Create a new dataframe called \(bgg3\) where you use the mutate function to create the following variables:

  • duration=2018-year+1
  • vote.per.year=num_votes/duration
  • own.per.year=owned/duration
  • player.range=max_players-min_players
  • log_vote=log(num_votes+1)
  • log_own=log(owned+1)
  • diff_rating=avg_rating-geek_rating
head(bgg3)

Question: In complete sentences, what is the purpose of adding 1 for the log transformed variables?

YOUR ANSWER IN COMPLETE SENTENCES

Question: In complete sentences, what is the purpose of adding 1 in the creation of the year variable?

YOUR ANSWER IN COMPLETE SENTENCES

Q2 (2 Points)

We hypothesize the geek rating increases when the number of votes increases and/or the ownership increases. Create four scatter plots showing the association with geek_rating and the following variables:

  • num_votes
  • owned
  • log_vote
  • log_own

Question: In complete sentences, describe how the relationship changes when you take the log of the independent variable.

YOUR ANSWER IN COMPLETE SENTENCES

Q3 (0.5 Points)

Randomly sample approximately 80% of the data in bgg3 for a training dataset and the remaining will act as a test set. Call the training dataset train.bgg and the testing dataset test.bgg.

set.seed(COMPLETE)

bgg4= bgg3 %>%
        mutate(Set=sample(COMPLETE))

train.bgg<-filter(bgg4,Set=="Train")
test.bgg<-filter(bgg4,Set=="Test")

Q4 (0.5 Points)

Now, we want to fit models to the training dataset. Use the lm() function to create 3 model objects in R called lm1, lm2, lm3 based on the following linear models, respectively:

  • \(\textrm{geek_rating}=\beta_0+\beta_1 log(\textrm{num_votes})+\epsilon\)
  • \(\textrm{geek_rating}=\beta_0+\beta_1 log(\textrm{owned})+\epsilon\)
  • \(\textrm{geek_rating}=\beta_0+\beta_1 log(\textrm{owned})+ \beta_2 \textrm{vote.per.year}+ \beta_3 \textrm{weight} + \epsilon\)
lm1 = lm(COMPLETE,data=train.bgg)
lm2 = lm(COMPLETE,data=train.bgg)
lm3 = lm(COMPLETE,data=train.bgg)

Q5 (1 Point)

Add predictions and residuals for all 3 models to the test set. Create a new data frame called test.bgg2 and give all your predictions and residuals different names. Use the str() function to show these variables were created

str(test.bgg2)

Q6 (0.5 Points)

Create a function called MAE.func() that returns the mean absolute error based on a vector of the residuals and test your function on the vector called test.

Solution 1:

test=c(-5,-2,0,3,5)



MAE.func(test)

Q7 (1 Point)

Use your function on the test.bgg2 to calculate the out-of-sample MAE of all three models based on the associated residuals. Make sure you display the mean absolute error from these different models in your output.

Question: Which model does the best job at predicting the geek rating of these board games?

YOUR ANSWER IN COMPLETE SENTENCES

Q8 (3 Points)

For the third model only, use 10-fold cross-validation and measure the out-of-sample mean absolute error. Print out the final cross-validated mean absolute error.

Question: What is the absolute difference between the out-of-sample mean absolute error measured using a test set and the mean absolute error measured using cross validation? When you type your answer in complete sentences use inline R code to calculate the absolute difference and input it directly into your sentence.

YOUR ANSWER IN COMPLETE SENTENCES