INTRODUCTION

When looking at data containing information on pokemon type and attributes, one immediately thinks of the uber-popular video game and tv show. As we looked at this data, we wanted to use it to provide us with valuable information and insights about some of the underlying ways the game itself works. In order to achieve this, we decided to take a look at some of the things that are most important to pokemon players. At first look, one of the things that stood out to us was the legendary or non-legendary status of pokemon. Legendary pokemon are widely considered to be the most powerful and effective in the game, and most players spend a lot of time trying to acquire these pokemon. One of the things that we found most interesting was the relationship between a pokemon’s attributes and its legendary status. After considering how this would affect the game, we wanted to analyze this question: given data or even a limited amount of data on an arbitrary pokemon, can we predict if the pokemon in a particular game is legendary?

Consider a scenario where a new generation of pokemon has been released but the legendary pokemon have not been revealed. Knowing this information about which pokemon are most likely to be legendary can inform players of the pokemon that are worth capturing or trading for. Because the hunt for legendary pokemon is such a big part of the game, being able to have the upper-hand when it comes to legendary pokemon is vital. While we have all of the data to analyze this question, we also wanted to consider how we could figure out a pokemon’s legendary status if we only have a limited amount of data. In other words, what are the most crucial Pokemon attributes when it comes to discovering a Pokemon’s legendary status? This question further deepens our understanding and allows us to have even more valuable knowledge.

As we dove further into our dataset, we wondered more about the things we could find out about pokemon. Not only was our focus on the pokemon in the data set, but also on the future of the pokemon to come. The only variable in the data set relative to time is the generation. As a result, exploring how generation is related to pokemon could tell us information about the way the game has evolved or will continue to grow in the future. In our data, each pokemon belongs to one of six distinct generations. As the game has evolved and more games have been produced, the creators of pokemon have released new types of pokemon in “generations,” which are essentially large sets of pokemon.

While attempting to see how the game has changed, we arrived at this important question: are there significant and explainable differences between pokemon generations, or are the groupings largely arbitrary? This is a crucial question because it can help us both understand the game more and predict how future pokemon may turn out. If a player understands in detail the different pokemon and how their generation affects their power in the game, it could make a huge difference in the success that they have. In addition, many people like playing with the newer pokemon, but is there any real advantage to using the later generations? The data and our analysis of it will attempt to answer this and provide us with important knowledge of the game and how to play it.

DATA

While our data is probably very recognizable, our source and the information are a lot more complex than they seem. The data we used was compiled by Alberto Barradas, a Computer Systems Engineer, graduate of Guanajuato University, and obvious pokemon enthusiast. The data itself was acquired from multiple sources including pokemon.com, pokemondb, and bulbapedia. The dataset contains 800 pokemon and their respective types and stats from the classic pokemon games, not pokemon cards or Pokemon Go. These 800 pokemon represent a subset of all the pokemon up until the latest pokemon generation.

In order to answer the questions we posed, we focused on almost all of the variables in our data set. Below is a simple table that provides a view of the pertinent data. This table lists the relevant variables for the first 10 pokemon in the data. Each row represents a singular pokemon and its stats. As you can see, they are all of the earliest generation, Generation 1. This table will be useful as the variables are explained further in this section. First we will look at the qualitative variables in the data. The simplest variable is Legendary, which is true if a pokemon is legendary and false if not. Another relatively basic variable is Generation, the generation number ranging from 1 to 6. Generation indicates when a specific pokemon was released into the game, with the first generation being the earliest and the sixth generation the most recent. Lastly, we have Type1, which gives the main type of the pokemon. A pokemon’s type is often its most unique identifier and is the main factor in determining a pokemon’s attributes. There are 18 unique types of pokemon in this data set, including Ground, Fire, Electric, Poison, and many more.

Type1 HP Attack Defense Sp.Atk Sp.Def Speed Generation Legendary
Grass 45 49 49 65 65 45 1 False
Grass 60 62 63 80 80 60 1 False
Grass 80 82 83 100 100 80 1 False
Grass 80 100 123 122 120 80 1 False
Fire 39 52 43 60 50 65 1 False
Fire 58 64 58 80 65 80 1 False
Fire 78 84 78 109 85 100 1 False
Fire 78 130 111 130 85 100 1 False
Fire 78 104 78 159 115 100 1 False
Water 44 48 65 50 64 43 1 False

Next are the pokemon attributes, which determine how effective a pokemon is when fighting other pokemon. HP, which stands for hit points, defines how much damage a pokemon can withstand before fainting. Attack is the basic variable which measures normal pokemon attacks such as Scratch or Punch, while Defense represents the base damage resistance against these normal attacks. In terms of special attacks, Sp.Atk stands for a pokemon’s special attack, while Sp.Def represents the base damage resistance against special attacks. Lastly, Speed determines which pokemon attacks first during each round. All of these attribute values range somewhere from 5, being the worst, to 250, being the best possible. Below you can see a radar graph that shows the attribues of a Bulbasaur. The radar chart illustrates the relative magnitude of each attribute for this pokemon.

For our initial question, we sought to use all of the above variables to predict the modeling variable, Legendary. This way, we utilize all combinations of attributes that actually dictate a pokemon’s legendary status. We also required generation for our predictive modeling and cross-validation methodology. For our generation question, we used Type1 along with all the other aforementioned pokemon attributes. This comprehensive variable usage allows us to best explore the data and determine how all the variables interact as the pokemon generation changes.

RESULTS

Question 1

To answer our first question and attempt to construct predictions on the legendary status of an arbitrary pokemon, we used extensive predictive modeling techniques. As a first step, we split our data through random sampling into train and test sets, the former composing 80% of the data and the latter 20%. Next, we built logistic models to predict the possibility that a pokemon is legendary using different combinations of predictors. Initially, we attempted to use all the variables at our disposal in our prediction. With all the information available, we used the bestglm function derived from the Bestglm package that helps to find a logistic model with the least BIC value in all possible combinations of predictors. BIC is an estimate of a function of the posterior probability of a model being true under a certain Bayesian setup, so that a lower BIC means that a model is considered more likely to be the best model.

When choosing from among all variables, we discovered that formula for the best model was Legendary = Attack + Defense + Sp.Atk + Sp.Def + Speed + Generation. This model had the lowest BIC of 165.25. As you will notice, this model uses six of the seven available predictors which have statistically significant p-values, all except HP. The next best model in this case was given by Legendary = Attack + Sp.Atk + Sp.Def + Speed + Generation. This model only uses five predictors, also all with statistically significant p-values. When further investigating this question, it is important to consider the situation in which all of the pokemon data is not available to us. In order to best replicate this, we built models based upon only 3 predictors. Again, using the bestglm function, we found the two models with the least BIC. The first model is based on the formula Legendary = Attack + Sp.Def + Speed and we found that the model has significantly small p-value with BIC equal to 185. The second model is Legendary = Attack + Sp.Atk + Sp.Def. All the predictors are effective, and the BIC is 186.9.

After finding our best models, we then ventured to see how effective they were in predicting the legendary status of a pokemon. With the best two models among all predictors and the best two three-predictor models, we used the test data we set aside at the beginning of the process and compared the true data with the predicted result. This was done through cross-validating our predictions with the actual results of the last grouping. For the models, we found when the model predicted the legendary status correctly as well as both the false positives and false negatives. A false positive refers to when a prediction of legendary is made when the pokemon is normal, and a false negative is the opposite. When we applied the models to our test set, we were able to see that the predictions were very accurate on our test set. Froom the figure below, we see that the model with six predictors as well as the three predictor model with Sp.Atk as one of its predictors were the best models for our test set.

Furthermore, we can used calculated metrics to give us numeric measure of the model performance. This process involved using sensitivity and specificity as our criteria to calculate our values. We also found the FPR (false positive rate), and FNR (false negative rate) of each model. The sensitivity refers to the ability to make the correct prediction regarding whether a pokemon is legendary when it is in fact legendary, while specificity refers to the ability to correctly determine that a pokemon is not legendary when it is a normal pokemon. FPR refers to the rate of false positives as described above, and the FNR is the same for false negative. The table below displays all the values, with the higher values for sensitivity and specificity findicating the better model. As you can see from the numbers, this table also indicated how the six predictor model and three predictor model with Sp.Atk are the best in our analysis. From all of our methods, we have determined that the pokemon data we have can fairly accurately predict whether or not an arbitrary pokemon is legendary or not. Our findings suggest that there is not much of a discrepancy between the best models, but regardless the data has been found to be quite capable of good predictions when it comes to legendary status.

Model Sensitivity Specificity FPR FNR
6 Predictor Model 0.5238095 0.9568345 0.0431655 0.4761905
5 Predictor Model 0.4285714 0.9568345 0.0431655 0.5714286
3 Predictor Model w/ Speed 0.2857143 0.9640288 0.0359712 0.7142857
3 Predictor Model w/ Sp.Atk 0.4285714 0.9712230 0.0287770 0.5714286

Question 2

Our next question attempts to find out if the generations in the data are simply arbitrary groupings of randomly generated pokemon, or if there is any real correlation between the six different sets of creatures. In our initial approach to the data, we attempted to use different modeling techniques to see if we could find any sort of relationship between the pokemon according to type and attributes. After multiple attempts, we were not able to find any results that showed a significant connection between the generations. For the most part, there seemed to not be much variation at all between generations. In order to better answer our question, we decided to take a creative approach. Our goal was to assess the distribution of both the types and the pokemon attributes across the generations, and hopefully discover something of significance.

One of the first things we noticed about the pokemon generations was the number of pokemon in each generation. Starting in the first generation, the number in each generation rises and falls significantly in an alternating pattern given by the following generation totals: 166, 106, 160, 121, 165 and 82. While this was interesting, we wanted to see if the pattern continued for each specific type of pokemon as well. In order to do this, we calculated the ratio of the number of each type of pokemon in each generation to the total number of pokemon in that generation. Using this new variable, we were able to produce the figure below titled “Pokemon Type/Total Ratios Across Generation” that shows how these ratios change across the generations. First, notice that some of the 18 pokemon types have been omitted from the plot. We opted not to include Psychic, Fighting, Ground, Ice, Dragon, Dark, and Steel pokemon because the variation among generation was small and random. In our analysis, we found that the change in ratio in the first five generations was mostly random. However, we found new information about the sixth generation that had not previously been discovered. When looking at the plot, the Water, Normal, Grass, and Bug types all make up the largest ratios over the first five generations; however, these four types make up four of the lowest ratios. At the same time, the Fire, Fairy, Ghost, and Rock types all rise from their lower ratios to above 0.1. In addition to this change, we also see the addition of a new type starting in the fifth generation, the Flying type seen at the bottom of the plot. As a result of all this, we begin to see larger changes and a split from the norm starting to occur in the sixth generation.

To further analyze the data and see if this pattern continued, we wanted to look specifically at the Pokemon attributes. In order to best look at the distribution of those attributes across the generations, we again formulated ratios. This time we calculated a ratio to generate each attribute as a percentage of the total attribute points for each pokemon. This if a Pokemon had an HP of 60 and a total attribute score of 300, the new ratio would be 0.2. We then took the mean attributes among each generatio and for each type within the generation. While we were hoping for a significant result, the analysis we did resulted mostly in the conclusion that the changes in attributes was stagnate and random. Attributes among all Pokemon did not vary much at all, going mostly between 0.15 and 0.2 When we looked further at the specific types, we noticed slightly more variation but there was no correlation found between any of the types, nor were there many major outliers. The figure below shows animated bar plots for two relatively major pokemon types: Water and Fire. This animated bar plot displays how the attribute ratios change for each of the types as generation increases. As you can see from these two plots, the variation among the generations has no distinct pattern both between types and for each type individually. We expected more from this analysis, as we figured there would be significant changes in pokemon attributes over generations in order to keep game players interested and entertained.

CONCLUSION

In our analysis, we attempted to address two important questions about the data. First, we wanted to assess the likelihood that we could predict whether or not a pokemon was legendary based on our data. After extensive modeling procedures, we determined that we actually could do a pretty good job. With models using all the data or limited amounts of it, we were able to make successful predictions on a test data set and provide numeric measures that indicated our success. Throughout our time working with this data, we operated with the expectation that we likely would be able to predict legendary status from the other variables. Since the legendary variable is largely dependent on the other statistics of the Pokemon, it makes sense that our models were fairly comprehensive. To take this analysis even further, one could look at trying to predict the ratings or attributes of normal pokemon that have very similar stats to legendary pokemon. While legendary are considered to be the most powerful, many normal pokemon have attributes that match or exceed that of legendary pokemon. Legendary pokemon are so hardly sought after, so being able to predict the states of high-powered normal pokemon could prove to be another valuable tool for the game. We were limited largely to a logistic regression given the binary nature of the legendary variable, but a dive deeper into the strongest normal pokemon would open up other opportunities for modeling and exploration that may work even better.

After considering legendary pokemon, our group wanted to look at how the game was evolving on a more expansive scale. To explore this, we posed this question: are there significant and explainable differences between pokemon generations, or are the groupings largely arbitrary? After conducting our analysis, we came up with a few different findings. On the large part, the generations were relatively consistent. With fairly arbitrary type and attribute distributions, there was not much correlation or separation between the groupings. The thing we found most interesting was our observation that many changes occured in the sixth generation among pokemon types. Pokemon types that had been consistently more frequent in previous generations all of the sudden became more scarce, and uncommon pokemon appeared increasingly more in the sixth generation. Furthermore, a new type, Flying, was added to the mix in the fifth generation and its frequency rose significantly in the sixth. It is important to consider the ramifications of these findings in the long-run for pokemon players. There are always rumors and predictions for new generations of pokemon to be released, and our findings indicate that pokemon may be changing in the future. If this trend continues, the newer generations of pokemon may start to look very different than the old ones. It will be difficult to predict exactly how they will look, but modeling techniques may become very useful if new generations are released or more data is made available.

At the end of the day, we need to look at how our questions and the answers we discovered are impactful. Although we are exploring data for a video game, it is a video game that is known and loved worldwide. People dedicate a lot of their lives to playing, exploring, and studying these games. Especially for pokemon enthusiasts, playing and succeeding in the game takes up a lot of time and effort. While it may seem insignificant to some, our findings in this data may make a big difference to others. In the end, the focus of our process was always on the future. We aimed to conduct our research in such a way that would allow us to predict what a new set of pokemon may look like before they are even released. In the eyes of a game player, this information is extremely important as it would allow players to prepare for new games and gain advantages much quicker than they would without access to the data. For example, say that when a new generation of pokemon is released, our predictive models for legendary status were run on this new generation. One very important piece to look at would be the false positives mentioned earlier in the paper. If our models predicted that a pokemon was legendary when it was actually not, we then know that this pokemon is likely of similar power to a legendary pokemon. While most players are on the hunt strictly for the legendary creatures, it may be more beneficial in the course of the game to acquire the more common but equally as potent normal pokemon. Instances like these are what make our analysis matter in the long run for those interested in the game. Whether it is pokemon or crime statistics, data can provide us with valuable information to better make decisions and improve our circumstances down the road.