Final Paper

INTRODUCTION

In a world where content on social media, such as YouTube, has become a big business, knowing how to increase the chance of popularity and virality is highly valuable to those wanting to reach out to as many viewers as possible. YouTube allows companies to be in front of hundreds of millions of people. Last year, it was estimated that the top YouTubers made over 20 million dollars. Influencers, companies and the like are constantly testing new methods for getting their message to more potential users and customers.

The theme of our analysis pertains to virality on YouTube and more specifically, “how to go viral”. Virality is a combination of views, likes, dislikes, comments, and interactions in general. What is interesting about our data set is that it shows that there may in fact be various factors, aside from the actual video itself, that go into a YouTube video’s chances of virality. In our initial data exploration, we were able to get a better idea of how the variables relate to one another. This helped us narrow our focus down to two questions related to views and how external events in our society can affect the YouTube videos.

Question 1: Which variables most influence the chance of obtaining a higher number of views?
Question 2: Can you identify when events occur by the words used in popular videos’ titles and tags?

For the first question, we looked into which variables might most heavily influence the chances of obtaining a higher number of views. We feel as though the answer to this question could help outline general features of successful videos. By summarizing this information, we can help potential content creators reach people more quickly and thus increase their potential for earning advertising revenue. Next, we looked into our second question by analyzing how current events influence the popularity of videos with key search words pertaining to the topic. This could be helpful in determining whether specific buzz words are common due to regular viewer habits or are instead determined by societal changes/events. This is also important in looking at how content creators can take advantage of large events in order to gain popularity on YouTube.

Through the exploration and analysis of the two aforementioned questions, perhaps we can discover what assists a video in going viral. If there is a formula, then YouTubers would undoubtedly want to use that information to help increase the overall outreach of their videos. If you want to know what it takes to go viral, then keep reading. The information is out there; now, it’s all about interpreting it and using it to our advantage. Here is your ultimate “how-to” guide to going viral… Virality = CASH MONEY.

DATA

Our dataset, collected by computer scientist Mitchell Jolly out of Scotland, includes observations from over 120,746 trending videos on YouTube. Each observation represents a trending video collected between November 14, 2017 and June 14, 2018. The columns of the data highlight a range of information for each trending video such as trending date, number of likes, dislikes, views, etc. The following table is a representation of the most important variables provided in our data:

ID	Trending_Date	Title	Channel	Category_ID	Views	Likes	Dislikes	Comments	Country
2kyS6SvSYSE	11/14/17	WE WANT TO TALK ABOUT OUR MARRIAGE	CaseyNeistat	22	748374	57527	2966	15954	US
1ZAPwfrtAFY	11/14/17	The Trump Presidency: Last Week Tonight with John Oliver (HBO)	LastWeekTonight	24	2418783	97185	6146	12703	US
5qpjK5DgCt4	11/14/17	Racist Superman \| Rudy Mancuso, King Bach & Lele Pons	Rudy Mancuso	23	3191434	146033	5339	8181	US

Our dataset includes 15 different variables, and we determined that some are definitely more applicable to our analysis than others. One important variable is the “trending date”, which is the date when the video was featured on YouTube’s trending page. From YouTube’s website, this is a proprietary algorithm that is updated every 15 minutes, including variables such as appeal, genuinity, diversity, views, how fast views are generated, as well as many other points. A unique “Category ID” represents the genre of the video from a selection of 32 different categories ranging from sports to politics to personal blogging. “Tags” are words or phrases that were uploaded alongside the video (by the uploader) to aid in search relevancy. Other important variables, some of which we used to determine virality, were “Views”, “Likes”, “Dislikes”, the number of “Comments” and which “Country” the video was uploaded from.

“Category ID” is of particular interest to our analysis of the two key questions. Once we determine which categories represent the majority of the data, we are better able to address each question with regards to the most important categories. Ultimately, we created our model for question 1 based on the most popular categories and then explored how certain categories may be influenced based on current events. Since the video category is such a key variable for our overall analysis, we created the following figure to depict how major variables relate to each of the most prominent “Category ID”s:

With regards to our project theme, we put quite a lot of focus on variables that indicate virality. The following are indicators we determined to be associated with a viral video: high number of views, liked and comments, all relative to other trending videos. Based on our initial data analysis, we found each of these three measures to be positively correlated to one another. In other words, more views typically means more likes and also more comments. Therefore, we can focus our attention on how the number of views relates to other variables in our data. If we understand these relationships in greater detail, we can understand more about what it takes to go viral.

RESULTS

In order to attempt to answer the first question, “Which variable most influences the chance of obtaining a higher number of views?” we had to modify the original data set to a workable size and scale. After initial attempts to model views based on “Publishing Hour”, “Likes”, “Dislikes” and “Comments”, we realized that even with removing outliers, outliers being anything greater than the third quartile plus 1.5 times the interquartile range, the scale of the variables particularly views made any visualizations misleading and unclear. We then decided to model based off the logarithm of “Views”, “Likes”, “Dislikes” and “Comments” in order to make the scale manageable.

Additionally, we decided to only include videos with “Category ID”s of Music, People & Blogs, Comedy, Entertainment and Howto & Style. These were the top five “Category ID”s, meaning they represented over 68% of the total data. Therefore, we believe that these five categories represent the best route for someone publishing videos to go viral. Other top categories such as News & Politics and Sports are dominated by established businesses, websites or news outlets such as CNN, ESPN, or Fox News. In the five categories that we selected to keep, we assumed that there is not a barrier for entry and anyone publishing videos in these categories could go viral.

To create the actual models, we split the remaining data into two data sets: “Test” and “Train”. For each variable, we created a linear model by testing values for the slope and intercept. We picked the top potential models by finding the combinations of the values that minimized the mean squared error and the mean absolute error. After obtaining these optimized values, we graphed the lines for each variable on plots using the “Test” and “Train” data. For these plots, x was the variable of interest, “Adjusted Likes”, while y was “Adjusted Views”.

Using the graphs, we concluded that “Publish Hour” seemed to have no meaningful impact on the number of adjusted views. Both graphs with the “Test” and “Train” data showed very little change to adjusted views as “Publish Hour” increased. The slope of the modeled line was -0.0235. The other variables, “Likes”, “Dislikes” and “Comments”, all had significant non-zero slopes: 0.778, 0.83, and 0.778, respectively. This means that they visually showed a positive correlation. This finding was supported by calculating the Mean Squared Error and Mean Absolute Error for the variables. “Publish Hour” had the largest values for both measures while “Dislikes” had the lowest values.

Finally, we created another model using the variables that we deemed to be significant: “Likes”, “Dislikes” and “Comments”. This model had the lowest MSE and MAE values, and all three variables had p-values <2e-16 in the regression. The relationship between predicted and actual adjusted views closely followed a 45 degree line, which indicates perfect predictions. This model from the training data performed the best of any when tested on the testing data, and the model from “Dislikes” performed the best of the models on the testing data with a single variable.

Models	MSE	MAE	Rank
Model Publish Hour	3.0221912	1.3707563	5
Model Likes	0.6771171	0.6448917	3
Model Dislikes	0.5980520	0.5914612	2
Model Comments	0.9420293	0.7571520	4
Model All Excluding Publish Hour	0.4043695	0.4913663	1

Our second question was, “Can you identify when events occur by the words used in popular videos’ titles and tags?” We first approached this question by deciding how we would want to visualize our findings and how we could clean and manipulate the data into the correct format for that visualization. Then, we decided to look for sporting events in the data, including if we could tell from the keywords when events occurred in the timeframe. In order to do this, we knew that we needed to distinguish the data by month, views, category, and trending date.

After selecting these columns from the main data, we determined that our final visualization could pull the top 100 most-used buzzwords from the top 1500 most viewed videos of each month. Then, our output could cycle through months and show the change of usage of these top buzzwords.

We first grabbed our code chunk from the exploratory data analysis that separated the “Tags” and “Title” columns into lists of individual words. Since our purposes are not comparing title vs. tags anymore, we eventually changed much of this code to output a master list of unique words from “Title” and “Tags” for each video, discarding duplicates, excess spacing and punctuation. Then, we removed insignificant words such as “how”, “the”, “new” and any two letter words. After we had this data, we then created eight unique Dataframes for each trending month.

After we had each of the unique month tables, we put them all together by month and count of the buzzwords. Below is a snippet of our first table from this section. As you can see from the head of the Dataframe, the top words for each month were highly similar. We attributed this to be because of the high popularity of music videos on Youtube that we had found in the exploratory data analysis. Since we had decided to look at sports, this was not helpful for our question, and we needed to narrow down the words by category.

November	November_Counts	December	December_Counts	January	January_Counts	February	February_Counts
video	308	video	238	video	273	video	282
2018	234	2018	201	official	207	official	210
official	222	official	190	2018	204	2018	201
music	216	music	181	music	173	music	199
funny	159	funny	170	funny	157	funny	188
comedy	129	show	149	show	130	show	149

Due to our focus of sports, we narrowed down the data by two categories: 17 (Sports) and 19 (Travel & Events). From here, we could clearly see more focus on the topic at hand in the buzzwords, but there was still much overlap in the buzzwords from month to month. To combat this, we focused on only the top 20 words from each month. We identified that the 2018 Super Bowl was in February, as well as the occurrence of NBA Saturday prime time television between January and April. For these reasons, in addition to needing consistent buzzwords to compare across months, we decided to take the top 20 buzzwords from February as the x-axis for our visualization and compare those words across the months. From here, we manipulated the data to be formatted for ggplot and gganimation.

Above is our final product for question 2. As you can see, the months rotate from November 2017 to June 2018, showing the number of uses of the key buzzwords we gathered from February. The words “super”, “bowl”, “england” and “patriots” all spike up in February, indicating what we were expecting to find about super bowl videos gaining popularity in February. We can also see that “basketball’ increases in appearance between February and April, in concurrence with NBA prime time televising; it then tapers off towards May and June. Another finding is the use of year as a tag or title. “2017” is highly used during November and December, then slowly is replaced by popularity for “2018” as 2018 begins.

CONCLUSION

In answering our first question about which variables most heavily influence views, we found that a higher number of dislikes improved views although we believe that this finding does not mean that YouTube publishers should publish videos that they believe will be controversial. The “Likes” model came in second, and this difference could be due to randomness in the data splits or YouTube factors such as internet trolls not allowing a video, no matter the content, to have zero dislikes. The combination model predicted views the best showing that combinations of interaction in any form may be the best factor in improving views. For our second question about identifying events in time by buzzword usage, we found that the top sports buzzwords are relatively constant throughout the months. Despite sports video remaining regularly popular and discussed on YouTube, certain buzzwords pertaining to massive events such as the Super Bowl can break through to the top as well. In terms of how to go viral, we can ultimately recommend that using buzzwords that pertain to certain events are most well-received when used in the month of that event.

These results are especially relevant in the real world considering how influential social media content has become to our society. People are constantly consuming visual content and this actively shapes their lifestyle decisions both consciously and subconsciously. YouTube is one such media platform that many bloggers and influencers utilize in order to broadcast various products, companies and ideas. With the potential for increases in popularity and in turn, increases in revenue, it makes sense why companies and video-creators would want to know what it takes to go viral. There is a major opportunity to become influential, make money, and truly have an impact on society. Since there is so much at stake, content creators should not just “wing it” and hope their videos spread rapidly by chance. Producing a video should be a calculated and well-thought-out effort in order to be both cost and time efficient as well as ensure their efforts will not be wasted. While it is important to know how to make the most of peoples’ online viewing habits now, it will only continue to rise in importance as advances in technology continue to rapidly expand.

While the dataset was fairly cohesive overall, we think there are certain additions or changes that would have made for a more robust analysis. We felt that having data on videos that did not end up trending would have been helpful. We could have used videos that did not trend as a control group. Comparing trending and non-trending videos could give more insight into which variables lead to virality. This would give us a better look into whether or not going viral is a random process. As our model stands we can predict views based on likes, dislikes, and other interactions. However, our data is only on trending videos. The effect these interactions have on view counts may be different for videos that were not trending. If we could come up with a virality score to use in our model that could greatly improve it. Some of the numbers are extremely large, especially with view count, and this could skew some of the results. We handled this issue in our model by taking the log of interaction values. However, if we could create a constant for virality that includes all of the interaction values, we may be able to increase predictive accuracy. If we had access to a greater range of dates, and if said dates were more recent, then we could have been able to provide a more up-to-date set of recommendations regarding the themes and similarities that exist between viral videos in 2019. We also felt it would have been beneficial if the dataset included information regarding where the viewers and video creators were located, not just where the video was uploaded from. With this information, we could have determined which countries, in terms of both the producing and receiving end, yield higher chances of virality. Furthermore, we could have explored which countries make up the highest number of viewers, which would enable content creators to target certain populations in order to acquire more views. This would have made our results even more specific, and we could have made recommendations regarding specific locations. We found it surprising that publishing hour had little effect on a video’s chances of improving views. Users published viral videos at all publish hour possibilities, and the distribution of views across publish hour was random. This randomness seems to contradict popular theories for other social media sites that there exists a sweet spot for publishing to gain higher amounts of interaction. While we had plenty of questions to explore with the information the dataset does provide, having the aforementioned information would have opened the door to additional interesting analysis.

Final Paper

STOR 320.01 Group 3

December 05, 2019

INTRODUCTION

DATA

RESULTS

CONCLUSION