Lab 4a: Control Structures

Introduction

The main purpose of this lab is to practice control structures in R:

  • if and else: testing a condition and acting on it
  • for: execute a loop a fixed number of times
  • while: execute a loop while a condition is true
  • repeat: execute an infinite loop (must break out of it to stop) • break: break the execution of a loop
  • next: skip an iteration of a loop

You will need to modify the code chunks so that the code works within each of chunk (usually this means modifying anything in ALL CAPS). You will also need to modify the code outside the code chunk. When you get the desired result for each step, change Eval=F to Eval=T and knit the document to HTML to make sure it works. After you complete the lab, you should submit your HTML file of what you have completed to Sakai before the deadline.

Part 1: Vector and Control Structures

1.1 (2 points)

Write code that creates a vector x that contains 100 random observations from the standard normal distribution (this is the normal distribution with the mean equal to 0 and the variance equal to 1). Print out only the first five random observations in this vector.

#

1.2 (2 points)

Write code that replaces the observations in the vector x that are greater than or equal to 0 with a string of characters "non-negative" and the observations that are smaller than 0 with a string of characters "negative". Hint: try ifelse() funtion. Print out the first five values in this new version of x.

#

1.3 (2 points)

Write for-Loop to count how many observations in the vector x are non-negative and how many observations are negative. (There are many easier ways to solve this problem. Use for-Loop or get 0 points. Use the cat() function to print out a sentence that states how many non-negative and negative obervations there are. For example, “The number of non-negative observations is 32”.

#

Part 2: Matrix and Control Structures

2.1 (4 points)

Create a \(100000\) by \(10\) matrix A with the numbers \(1:1000000\). The first row of this matrix should be the numbers 1 to 10. The second row of this matrix should be the numbers 11 to 20. Create a for-loop that calculates the sum for each row of the matrix and save the results to a vector sum_row and print out the first five values of sum_row.

A = matrix(1:1000000, COMPLETE) # DO NOT CHANGE

Verify that your results are consistent with what you obtain with the built-in rowSums function.

sum_row_rowSums = as.integer(rowSums(A))
sum_row_rowSums[1:5]

2.2 (4 points)

Another common loop structure that is used is the while loop, which functions much like a for loop, but will only run as long as a test condition is TRUE. Modify your for loop from the previous exercise and make it into a while loop. Use the identical() function to check if the results from the for loop are the same as the results from while loop.

#

Part 3: Data Frame and Control Structures

3.1 (4 points)

Write a for loop to compute the mean of every column in mtcars and save the results to a vector col_mean. Ignore missing values when taking the mean.

#

3.2 (2 points)

Compute the number of unique values in each column of iris and print the results during a loop. Use the cat() function to print out the values in a sentence with the corresponding name of the variable. For example, “The number of unique values for Sepal.Length is 35”.

names(iris) #DO NOT CHANGE
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"

Lab 4b: Modeling Basics

Introduction

In this lab, you will build predictive models for board game ratings. The dataset below was scraped from boardgamegeek.com and contains information on the top 4,999 board games. Below, you will see a preview of the data

bgg<-read.csv("bgg.csv")
bgg2=bgg[,c(4:13,15:20)]
head(bgg2)
##                                           names min_players max_players
## 1                                    Gloomhaven           1           4
## 2                     Pandemic Legacy: Season 1           2           4
## 3 Through the Ages: A New Story of Civilization           2           4
## 4                             Terraforming Mars           1           5
## 5                             Twilight Struggle           2           2
## 6                          Star Wars: Rebellion           2           4
##   avg_time min_time max_time year avg_rating geek_rating num_votes age
## 1      120       60      120 2017    8.98893     8.61858     15376  12
## 2       60       60       60 2015    8.66140     8.50163     26063  13
## 3      240      180      240 2015    8.60673     8.30183     12352  14
## 4      120      120      120 2016    8.38461     8.19914     26004  12
## 5      180      120      180 2005    8.33954     8.19787     31301  13
## 6      240      180      240 2016    8.47439     8.16545     13336  14
##                                                                                                                                                                             mechanic
## 1 Action / Movement Programming, Co-operative Play, Grid Movement, Hand Management, Modular Board, Role Playing, Simultaneous Action Selection, Storytelling, Variable Player Powers
## 2                                        Action Point Allowance System, Co-operative Play, Hand Management, Point to Point Movement, Set Collection, Trading, Variable Player Powers
## 3                                                                                                                      Action Point Allowance System, Auction/Bidding, Card Drafting
## 4                                                                                             Card Drafting, Hand Management, Set Collection, Tile Placement, Variable Player Powers
## 5                                                         Area Control / Area Influence, Campaign / Battle Card Driven, Dice Rolling, Hand Management, Simultaneous Action Selection
## 6                                                                  Area Control / Area Influence, Area Movement, Dice Rolling, Hand Management, Partnerships, Variable Player Powers
##   owned
## 1 25928
## 2 41605
## 3 15848
## 4 33340
## 5 42952
## 6 20682
##                                                                                 category
## 1                                  Adventure, Exploration, Fantasy, Fighting, Miniatures
## 2                                                                 Environmental, Medical
## 3                                                      Card Game, Civilization, Economic
## 4 Economic, Environmental, Industry / Manufacturing, Science Fiction, Territory Building
## 5                                                     Modern Warfare, Political, Wargame
## 6              Fighting, Miniatures, Movies / TV / Radio theme, Science Fiction, Wargame
##                       designer weight
## 1               Isaac Childres 3.7543
## 2     Rob Daviau, Matt Leacock 2.8210
## 3               Vlaada Chvátil 4.3678
## 4              Jacob Fryxelius 3.2456
## 5 Ananda Gupta, Jason Matthews 3.5518
## 6              Corey Konieczka 3.6311

Board Game Analysis

Q1 (1.5 Points)

There are 16 variables and we want to create some more. Create a new dataframe called \(bgg3\) where you use the mutate function to create the following variables:

  • duration=2018-year+1
  • vote.per.year=num_votes/duration
  • own.per.year=owned/duration
  • player.range=max_players-min_players
  • log_vote=log(num_votes+1)
  • log_own=log(owned+1)
  • diff_rating=avg_rating-geek_rating
head(bgg3)

Question: In complete sentences, what is the purpose of adding 1 for the log transformed variables?

YOUR ANSWER IN COMPLETE SENTENCES

Question: In complete sentences, what is the purpose of adding 1 in the creation of the year variable?

YOUR ANSWER IN COMPLETE SENTENCES

Q2 (2 Points)

We hypothesize the geek rating increases when the number of votes increases and/or the ownership increases. Create four scatter plots showing the association with geek_rating and the following variables:

  • num_votes
  • owned
  • log_vote
  • log_own

Question: In complete sentences, describe how the relationship changes when you take the log of the independent variable.

YOUR ANSWER IN COMPLETE SENTENCES

Q3 (0.5 Points)

Randomly sample approximately 80% of the data in bgg3 for a training dataset and the remaining will act as a test set. Call the training dataset train.bgg and the testing dataset test.bgg.

set.seed(COMPLETE)

bgg4= bgg3 %>%
        mutate(Set=sample(COMPLETE))

train.bgg<-filter(bgg4,Set=="Train")
test.bgg<-filter(bgg4,Set=="Test")

Q4 (0.5 Points)

Now, we want to fit models to the training dataset. Use the lm() function to create 3 model objects in R called lm1, lm2, lm3 based on the following linear models, respectively:

  • \(\textrm{geek_rating}=\beta_0+\beta_1 log(\textrm{num_votes})+\epsilon\)
  • \(\textrm{geek_rating}=\beta_0+\beta_1 log(\textrm{owned})+\epsilon\)
  • \(\textrm{geek_rating}=\beta_0+\beta_1 log(\textrm{owned})+ \beta_2 \textrm{vote.per.year}+ \beta_3 \textrm{weight} + \epsilon\)
lm1 = lm(COMPLETE,data=train.bgg)
lm2 = lm(COMPLETE,data=train.bgg)
lm3 = lm(COMPLETE,data=train.bgg)

Q5 (1 Point)

Add predictions and residuals for all 3 models to the test set. Create a new data frame called test.bgg2 and give all your predictions and residuals different names. Use the str() function to show these variables were created

str(test.bgg2)

Q6 (0.5 Points)

Create a function called MAE.func() that returns the mean absolute error based on a vector of the residuals and test your function on the vector called test.

Solution 1:

test=c(-5,-2,0,3,5)



MAE.func(test)

Q7 (1 Point)

Use your function on the test.bgg2 to calculate the out-of-sample MAE of all three models based on the associated residuals. Make sure you display the mean absolute error from these different models in your output.

Question: Which model does the best job at predicting the geek rating of these board games?

YOUR ANSWER IN COMPLETE SENTENCES

Q8 (3 Points)

For the third model only, use 10-fold cross-validation and measure the out-of-sample mean absolute error. Print out the final cross-validated mean absolute error.

Question: What is the absolute difference between the out-of-sample mean absolute error measured using a test set and the mean absolute error measured using cross validation? When you type your answer in complete sentences use inline R code to calculate the absolute difference and input it directly into your sentence.

YOUR ANSWER IN COMPLETE SENTENCES