Instructions:

The purpose of this mini project is for you to practice analyzing complicated categorical variables where there is no finite list of possible responses that can be listed. When individuals have the ability to respond in sentences, we often get variables that are hard to analyze since every response is going to be unique since. The data we are going to look at is AirBnB review data from the city of Boston. The data was downloaded from Kaggle.com but the original source of the data is http://insideairbnb.com/get-the-data/. Because of this, you are not allowed to use any AirBnB data in your final project. The data is loaded in R and previewed in the code chunk below. Even the variables that are classified as “integer” or “int” are really categorical variables.

If you see “#DO NOT CHANGE”, these lines of code are designed for you to run and examine, but do not change them in any way.

reviews=read.csv(file="reviews.csv",header=T)

str(reviews)
## 'data.frame':    68275 obs. of  6 variables:
##  $ listing_id   : int  1178162 1178162 1178162 1178162 1178162 1178162 1178162 1178162 1178162 1178162 ...
##  $ id           : int  4724140 4869189 5003196 5150351 5171140 5198929 6702817 6873023 7646702 8094418 ...
##  $ date         : chr  "2013-05-21" "2013-05-29" "2013-06-06" "2013-06-15" ...
##  $ reviewer_id  : int  4298113 6452964 6449554 2215611 6848427 6663826 8099222 7671888 8197342 9040491 ...
##  $ reviewer_name: chr  "Olivier" "Charlotte" "Sebastian" "Marine" ...
##  $ comments     : chr  "My stay at islam's place was really cool! Good location, 5min away from subway, then 10min from downtown. The r"| __truncated__ "Great location for both airport and city - great amenities in the house: Plus Islam was always very helpful eve"| __truncated__ "We really enjoyed our stay at Islams house. From the outside the house didn't look so inviting but the inside w"| __truncated__ "The room was nice and clean and so were the commodities. Very close to the airport metro station and located in"| __truncated__ ...

In each section, I will ask you to do something with this data. Below you will find links of reading material that I used in the creation of this assignment. You still may need to read the documentation of packages or R functions or search for help online. I expect you to do this assignment on your own without the help of another human or the use of AI tools like ChatGPT. If you get help from another student or use something like ChatGPT on this assignment, you will receive a 0 and be reported.

Helpful Reading Material:

Reviewers

If every review was written by the same person, we would have a serious problem. If all the reviews were written by 10 different people, we would have a serious problem. It is important that we investigate the number of reviews by each reviewer.

First, I want you to calculate/count the number of reviews from each reviewer based off the reviewer_id. Instead of showing these calculations/counts, I want you to use a histogram to show the distribution of the number of reviews from individual reviewers. Use geom_histogram() to create this histogram.

Also, on the visual, I want you to add text using geom_text()to show your audience the average number of reviews in your sample of reviewers, the median number of reviews in your sample of reviewers, and the proportion of all unique reviewers who wrote exactly 1 review. I want to see these statistics written inside of the histogram. Round your values to 2 decimal places.

See the example image named hist_ex.png in the folder to see exactly what I am looking for. I want your x-label and your text to match mine. Your numbers and/or histogram may be different. Use a font size of 8 in geom_text() and try to choose coordinates that put the text in the top right near mine with the mean over the median over the proportion. Your coordinates will not be identical to mine. My advice is to consider code that looks like geom_text(x=?, y=?, label="?",size=8). You need to figure out how the function works to change the question marks appropriately.

Also, when I knit, I get a message that prints about the “bins”. I don’t want to see this in your html file. Use the code chunk option message=FALSE to suppress this message.

#

String Lesson 1

Below I give you a vector called example which contains four strings. Suppose I wanted to create a binary vector called hello that shows 0 when the string doesn’t contain the word “hello” and 1 when the string does contain the word “hello”. I can use the str_detect() function to return “TRUE” or “FALSE” depending on whether or not the pattern “hello” is detected. The function as.numeric() converts “True” to 1 and “False” to 0. However, in the code below you will notice that the only 1 is for the second string where the “h” is lower case.

I want you to create your vector hello by using functions applied to the vector example. You can modify the code I have given you below. Your vector should result in c(1,1,1,0). Print your vector hello so it shows in your output. If you just write hello = c(1,1,1,0) and then you print this out, I want the grader to take off 1 more point than the question is worth since this is nonsense.

I just want to say, there are many ways you can do this. I would recommend converting example to lower case. This is a strategy often seen in text analysis, and I now that you know the strategy, I trust that a simple Google search can fill in the gaps. If you completely erase my code, and use different functions to do the same thing, you will not be penalized.

example = c("Hello, I am here","I like to hello", "I HELLO when I HELLO","I like jello") #DO NOT CHANGE

hello = as.numeric(str_detect(example,"hello")) 

hello
## [1] 0 1 0 0

String Lesson 2

Sometimes it is helpful to count the number of letters or the number of letters that are capitalized.

IF I WRITE LIKE THIS IT IS PROBABLY BECAUSE I HAVE ANGER ISSUES AND NEED HELP.

I am going to give you examples using strings with numbers. I am showing you how to count the total number of alphanumeric characters and the number of digits. See the links for information on regular expressions to get ideas on how my code can be manipulated for example2.

I want you to create a vector named num_letter which returns a numeric vector of the number of letters (lower case or upper case) for each string in example2. Letters are the characters from a to z and from A to Z. I don’t want you to count punctuation.

I also want you to create a vector named num_allcap that returns a numeric vector of the number of capital letters in the string which would just be the characters A to Z.

Then, print out the vectors num_letter and num_allcap to confirm that your code worked. Again, you can count the letters or capital letters and create vectors using the c() function, but if you do this, the grader is going to take off 1 more point than the question is worth in the rubric, because THAT NONSENSE WILL BE CONDEMNED. I would hand count to know what the correct answer is, but please use functions in R applied to example2 to do this. I give you examples of what I am looking for. If you completely erase my code, and use different functions to do the same thing, you will not be penalized.

example2 = c("I love teaching.", 
             "I LOVE teaching and I am weird.", 
             "I LOVE TEACHING ... NOT." ) #DO NOT CHANGE

num_letter = str_count(c("swag345","42money42"),"[[:alnum:]]") 
num_allcap = str_count(c("swag345","42money42"),"[[:digit:]]") 

num_letter
## [1] 7 9
num_allcap
## [1] 3 4

Applying Lesson 1

I want you to think of 4 words that have positive connotations when seen in a review, and 4 words that have negative connotations when seen in a review. For example, I might pick the word “Lovely” for a postive word, and “Disgusting” for a negative word. You should look through the comments variable in the reviews dataset to get an idea of the type of language commonly seen in reviews. Do not use the words that I give as examples and write the words in the list below by replacing WORD1, WORD2, …, WORD4 with the words you chose written in lowercase. For my example, I would write “lovely” and “disgusting”. (Don’t write your words in quotations and don’t choose the words I gave you)

Now, I want you to create a dataframe called reviews2 based off the reviews dataset which adds variables that have the same names of the words you chose and wrote above. Each positive word variable should be binary (0 or 1) indicating if the positive word is in the comment. For example, I would create a variable called “lovely” which is 0 if the word “lovely” is not in the comment and 1 if the word “lovely” is in the comment. For the negative word variables, I want it to be (0 or -1) indicating if the negative word is in the comment. The value of -1 for the word “disgusting” would indicate that, yes, the word “disgusting” is in the comments.

The code colSums(reviews2[,7:14]) sums up the columns for the 8 new variables. I want you to run this code at the end, and this should be the ONLY output. If you pick a word that DOESN’T EXIST in any of the 68,275 reviews, you need to pick a different word.

#

Applying Lesson 2

Now, I want you to create a new dataset called reviews3 based off reviews2. Add variables to your dataset called num_letter and num_allcap that measure the number of letters and the number of capitalized letters in the comments. However, I also want you to create a variable called prop_allcap which measures the proportion of the letters in each comment that are capitalized. Think about why I the proportion would be more useful than just the count.

The code colMeans(reviews3[,15:17],na.rm=T) takes the average of the three new variables. I want you to run this code at the end, and this should be the ONLY output.

#

Questions

Answer the following questions in the space provided in complete sentences. For these questions, I think there is an ideal answer, but there may be other good answers if defended well or explained. Each question is worth 2 points. You will lose 1 point if there is bad grammar or a lack of complete sentences anywhere.

  1. Suppose we create a new variable called Pos_Interact that is basically the multiplication or product of the 4 “positive” variables we created. How would you explain what this new variable measures to a person with little math background? Also, I want you to defend why you believe this new variable would be useful or not be useful in our dataset.

PLACE ANSWER HERE IN COMPLETE SENTENCES.

  1. Suppose we create a new variable called Neg_Interact in the exact same way for the negative word variables. What would be the possible values of this new variable and what does each value represent in words. Your description of each of the possible values in words should be written in a way that a person with little math background would understand.

PLACE ANSWER HERE IN COMPLETE SENTENCES.

  1. The reason why counted the number of capital letters is because the amount of capital letters could indicate a very angry comment and therefore a very negative comment. Writing in all caps sometimes is used to convey excitement or emphasis of points. However, I think that in this context, the use of all caps is probably more often negative than positive. Based off understanding my defense of this variable, why is the prop_allcap variable more valuable or better than the num_allcap variable? I want you to explain this as if you are talking with someone with very little math background. I want you to make up an example or situation that helps your audience understand the superiority of prop_allcap.

PLACE ANSWER HERE IN COMPLETE SENTENCES.

Word Cloud

In this section, I want you to read recommended reading material about creating word clouds. We have 68,275 different comments from different people. When I followed the step-by-step instructions on the website to create a word cloud, I ran into an error due to the size of vector being 28 Gigabytes. Because of this, I want you to randomly sample 10,000 of the 68,275 comments and make a word cloud. I want the word cloud to only contain words that occurred at least 500 times. I also want to see a word cloud with a maximum of 200 words. You will need to convert your vector of comments to a corpus(list of texts) so that you can clean the data using the tm library in R. For example, we need to remove numbers, punctuation, and useless words like “a”, “and”, “the”, etc.

The sample() function in R can be used to sample 10,000 comments. For example, sample(1:10,5, replace=FALSE) can be used to randomly choose, without replacement, 5 numbers out of a bag that contains the numbers 1 through 10. The code I give you below samples 50 out of the first 1000 comments in the original dataset named reviews. Since we are randomly sampling, I recommending using the set.seed() function in R to ensure that every time you knit the document, you get the same random sample of 10000 comments.

After you figure out how to make the word cloud using the vector of 50 comments in sample_reviews, you should modify the sample() function so that you are sampling 10,000 random comments. The only output from this code should be your word cloud. Remember, I only want words that occur at least 500 times in the sample of 10,000 comments, and I only want a word cloud with 200 words or less. You should use the code chunk option warning = FALSE to suppress warnings about the word cloud not being able to include words in the word cloud due to limitations. In my solutions, I did the basic round word cloud from library(wordcloud), but you may do more advance things with other word cloud packages as long as you follow my instructions.

#library(wordcloud)
#library(RColorBrewer)
#library(wordcloud2)
#library(tm)

set.seed(3456)
sample_reviews = reviews[sample(1:1000,50,replace=FALSE),]$comments

Creative Visual 1

Make a visual that you think would be useful that involves summing up your binary positive variables and summing up your binary negative variables. I am expecting you to create two new variables only based off the 8 variables you created.

#

Creative Visual 2

Make a visual to summarize the change in something overtime. I am expecting the x-axis to represent time (year, month, etc.), and the y-axis could be anything. I want you to convert the date variable which is currently a chr variable to an actual date variable. You can create other new variables that you think are interesting here.

#

Creative Visual 3

Make a visual that compares listings that are exceptionally bad listings that are exceptionally good. How you define exceptionally good and bad is up to you. I would definitely think about this: Would it be fair to say a restaurant is exceptionally bad because 5/10 reviews are bad? Critically, think about the visual you create, but the audience should see the worst listings and the best listings so they know the disparity between what is good and what is bad. This is another situation where creating other variables would be very useful and show creativity.

#

Rubric

Task Points
Reviewers: Histogram 2 Points
Reviewers: x-Label 1 Points
Reviewers: Text in Top Right 1 Points
Reviewers: Correct Font Size 1 Points
Reviewers: Correct Numbers 3 Points
Reviewers: No Message 1 Points
String Lesson 1 2 Points
String Lesson 2 2 Points
Apply Lesson 1: Modified Bullet Points 2 Points
Apply Lesson 1: Four Positive Vars Created Correctly 2 Points
Apply Lesson 1: Four Negative Vars Created Correctly 2 Points
Apply Lesson 1: Output Doesn’t Contain 0’s 1 Points
Apply Lesson 1: Showed Output 1 Points
Apply Lesson 2: Matching Output for Each Variable 3 Points
Questions: Correct/Acceptable Answers 6 Points
Questions: Complete Sentences and Proper Grammar 1 Point
Word Cloud:Correct Code for Sampling 2 Points
Word Cloud:Relevant Code for Word Cloud 2 Points
Word Cloud:Word Cloud is in Output 3 Points
Word Cloud:No Warnings in HTML File 2 Points
Creative Visual 1 2 Points
Creative Visual 2 2 Points
Creative Visual 3 2 Points