INTRODUCTION

Migration has been part of America since the 1600’s, when the first European settlers arrived in the United States’ territory. As a growing number of human beings from all around the world rushed into the land, the United States began to restrict the entry of newcomers. In 1882, the U.S. Congress passed a new Immigration Act stating that a 50-cent tax would be levied on all aliens, which marked the beginning of the resistance for future immigrants’ journeys. In recent years, much debate has been stirred regarding the issue. Due to the strict and competitive conditions, many people chose to enter the U.S. for a better living environment through hazardous paths. While the overall general public tends to feel that illegal immigrants chose the easy path to enter the territory and take advantage of resources, more scholars are discovering the extreme high risks and hardships that immigrants need to go through before they reach the land and the fact that most illegal immigrants lose lives in their attempts to cross the borders. According to New York Times, in October 2013, around 368 migrants died in two shipwrecks near the Italian Island of Lampedusa. This tragedy gained huge attention from the public and inspired the International Organization for Migrants to start the Missing Migrants Project.

Missing Migrant Project tracks the deaths of migrants including refugees and asylum-seekers. The purpose of this dataset is to reveal and exemplify the danger of transportation in order to advocate for a smoother legal track for migration. Our group decided to primarily examine the relationships between different variables and total dead/missing. Our hypothesis is that there will be a geographical concentration of total dead/missing, since people who have the intention to migrate are exposed to popular recommendations about migration routes. We also hypothesized that the total dead/missing should be predictable, because each variable should play a role in the result. The first question examines whether we can see a geographical concentration of total dead/missing by creating heat maps to visualize where these deaths occurred. Using this information, an owner of the data would be able to predict these rates and possibly use the information to enhance safety around the risky areas. Secondly, by employing models, our investigation focused on if we can predict the total dead/missing as a function of variables such as percent of females, cause of death, etc. This would illuminate specific reasons that could potentially contribute to the number of total dead missing. Our group found it crucial to reveal and present the reality of migration to arouse sympathy and advocate for humanity among the public.

DATA

Our data came from the Missing Migrants Project, a joint venture between the International Organization for Migration’s Global Migration Data Analysis Centre and the Media and Communications Division. The project uses various sources including medical reports, surveys, and field missions to gather data on migrants around the world who have died or gone missing along their journey between 2014 and 2019. The data is limited to incidents that occur during migration and does not include deaths that occur at detention or housing facilities. The dataset has 5,333 rows in total, and each row represents an incident that resulted in missing migrants. This incident could be a vehicular accident, a shipwreck, drowning, etc. Each incident/row has a unique Web ID which can be considered the primary key in this dataset. The region of incident variable classifies the region where the incident occurred. Our first question specifically investigates the ‘US-Mexico Border’ region. These regions correspond approximately to major geographical regions of continents. Month refers to the month in which the incident occurred. The number dead variable counts the total number of people that are confirmed to be dead and the number missingvariable counts the total number of people presumed to be dead because they are missing, mostly recorded in shipwreck accidents. We mainly look at the total dead and missing variable which is the sum of number dead and number missing. Number of females, number of males, and number of children refer to the number of people in each category found dead or missing, and is not necessarily related to total dead and missing. Cause of death identifies the reason for the death, whether it was dehydration, drowning, unknown, or any combination of these. The source quality variable ranks the quality of the sources from which information about the incident was obtained, on a scale from 1-5 with 1 being the worst and 5 being the best. An incident with a source quality of 1 would come only from one source but level 5 quality comes from a combination of reputable sources.

Region Incidences Total Dead and Missing Average Deaths per Incident Standard Deviation Maximum Deaths per Incident Minimum Deaths per Incident
US-Mexico Border 1337 1964 1.468960 5.846525 149 1
North Africa 1239 4027 3.250202 5.000681 57 0
Mediterranean 984 18229 18.525407 56.323161 1022 0
Sub-Saharan Africa 475 1549 3.261053 12.329481 251 1
Central America 309 619 2.003236 5.280138 64 0
Europe 249 442 1.775100 5.159319 71 1
Horn of Africa 235 1152 4.902128 11.047053 70 1
Middle East 164 406 2.475610 4.066097 38 1
South Asia 151 286 1.894040 3.031286 23 1
Southeast Asia 96 2203 22.947917 81.578683 750 1
Caribbean 59 499 8.457627 12.171649 68 1
South America 28 92 3.285714 5.097982 24 1
East Asia 5 31 6.200000 5.761944 15 1
Central Asia 1 52 52.000000 NA 52 52
North America 1 1 1.000000 NA 1 1

This table groups the incidences by region of occurrence and displays the number of incidences, total number of dead/missing, average number of dead/missing per incident, the standard deviation, and the highest and lowest number of deaths in an incident. The US-Mexico border had the highest number of occurrences, which is partly why it is the focus of our first follow up question. East Asia, Central America, and North America had 5 or less incidences, making them less relevant for analysis. The Mediterranean had 18,229 total deaths, which is drastically higher than any of the other regions, indicating that some significant incident may have occurred here. Though Southeast Asia only had the 10th highest number of incidences, it had a much higher total death count, average, standard deviation, and maximum number relative to the other regions, suggesting that conditions in this region may be more precarious due to poor infrastructure or geography.

This figure illustrates the trend between number of incidences and total deaths in each region. Most regions with under 400 incidences have less than approximately less than 1500 total deaths. The few exceptions to this are Southeast Asia and Subsaharan Africa, which have under 500 incidences but almost 2000 deaths. The US-Mexico Border has about the same number of deaths as Southeast Asia, but more than ten times as many incidences, highlighting the fact that deaths in Southeast Asia must be concentrated. On the other hand, North Africa has about the same number of incidences as the US-Mexico border but 4 times as many deaths, so it must be in a similar situation as Southeast Asia but to a more extreme. It is important to note that the Mediterranean was excluded from this figure because it had 18,000 deaths which is an outlier and it would skew the figure scale for the rest of the bars. Our second follow up question required manipulation of some variables. We created new variables childrenper, femalesper, and malesper, which take the respective number of children, males, or females, and divide by the total dead and missing. We did this because we know that the sum of these three variables be equal to the total, and representing them as percentages would be more meaningful. A causebinary is used as one of our model predictors, where a value of 1 is assigned if the cause of death is “unknown” or “uknown (skeletal remains)”, and 0 if the cause is anything else. The regions were also sorted into a new_region based on their approximate continental location–for example, “US-Mexico Border” and “North America” were grouped into a new_region = “North America”.

With respect to the completeness of our dataset, we found that certain variables throughout the set contained numerous missing values. Much of this incompleteness is attributed to a lack of information in the original source material detailing each observation. For example, the variables children, females, and males contained a number of missing values. These variables were used extensively in our second question, and so we chose to preserve the validity of the data used in the regression by not imputing. This choice did, however, have a cost. As observations with any missing values could not be factored into a multiple linear regression, we lost a large portion of observations. From the original 5,333 observations in the dataset, we allocated 3,519 observations to be used in our training set for cross-validation. Of these 3,519 training observations, only 218 observations could be used in our regression due to missing values in our variables of interest.

RESULTS

Question 1: Incidences and Deaths along the US-Mexico Border

In our EDA, we initially explored where incidences of migrant-related deaths occur along the US-Mexico Border. We plotted the distribution of observations over a map of the US-Mexico border pulled from Stamen Maps and constructed a heatmap that was placed over the geographic map. This detailed the areas of concentrated frequency to better visualize where most migrant-related incidents were occurring, which was in the Eastern and Western parts of the border.

After discovering that there was a hotspot of incidents in the Eastern and Western parts of the border, we were led to two further questions: whether hotspots of particularly high values of total dead and missing per incident showed a noticeable pattern relative to the frequency distribution of incidents and also why such concentrations of incidents existed in the first place. To answer this first question, we composed a heatmap illustrating the distribution of total dead and missing across the same geographic area.

Total Dead/Missing per Incident

Total Dead/Missing per Incident

To our surprise, this heatmap indicated that hotspots of particularly high values of total dead or missing per incident had no relationship to the frequency distribution of incidents. In fact, when we zoomed in on this plot further we found that most of the incidents of total dead or missing had only one death for the high frequency incident regions. So despite certain areas having a high amount of frequency of incidents, this did not indicate a high number of total deaths and missing per incident.

In order to answer the second question we composed further plots illustrating the incidents in each particular region of the border.

Question 2: Predictors of Total Dead/Missing

In our second question, we wanted to fit a model in order to find a predictor for the total dead and missing within our dataset. We fit a multiple linear regression model in order to map this relationship. 70% of the data was randomly assigned to the training set, and the rest to the test set for the sake of cross-validation. The first model we constructed predicted totaldeadmissing as a function of childrenper and femalesper, but the R-squared value was only 0.18. In our next model we tried to predict the log(totaldeadmissing) using the same variables as above, but also added causebinary. The R-squared increased to 0.70, so we decided to stick with the logarithmic function. For model 3, we augmented the previous model with factor versions of the source quality, new_region, month and interactions between some of the aforementioned variables. The resulting R-squared was 0.81, our highest yet. For our last model, “factor(sourcequality)” was replaced with sourcequality + childrenper*factor(sourcequality), still yielding an R-squared value of 0.81. This model would then serve as the composite predictor pool that we would use with a model selection process.

The model selection process we used to construct this model was the regular subsets method, which determines the subsets of variables from our final predictor pool that would create the best model based off of the Mallow’s Cp statistic. The function used to complete this process prints a table containing the best model from the predictor pool given a particular number of variables to be included. From this table we then compared the three best models: two with the highest R-squared values and one with the lowest Mallow’s Cp statistic. In reviewing these models we found that the model with the lowest Mallow’s Cp had an adjusted R-squared value of 0.8087, while the other models had adjusted R-squared value of 0.8131 and 0.8161. We then conducted a comparison of the Root Mean Squared Error values of each of these models relative to the training and test sets. We found that the RMSE’s of the lowest Mallow’s Cp model were 46.87 for the training set and 29.29 for the test set. The model with highest adjusted R-squared had RMSE’s of 46.86 for the training set and 29.28 for the test set. Finally, the model with the second highest adjusted R-squared value had RMSE’s of 46.86 for the training set and 29.28 for the test set as well. Given the information we gained from these statistics, we chose to use the model with the lowest Mallow’s Cp, as this model had fewer predictors and produced very comparable results to the other models.

Once we had determined our best predictive model, we set out to evaluate it further. Upon plotting a scatterplot of the model’s predicted value of totaldeadmissing from the test set in respect to the actual measured value given by taking the natural log of the totaldeadmissing variable in the test set, we found that there was a moderately accurate fit. We did however notice some predictive issues near the sample extremes of the variable.

To further bolster our analysis of our constructed model, we constructed a normal quantile-quantile plot to illustrate the distribution of the residuals. This plot illustrates that the residuals follow a somewhat normal distribution aside from skewing at the upper extreme quantiles. This finding is in line with our previous findings, as it shows the right-skewed nature of the variables distribution.

CONCLUSION

In our first question, we looked at the distribution of total dead/missing using heat maps and investigated whether the concentration of the majority of the incidents had a relationship with the number of dead/missing individuals. We found that no such relationship exists and in fact the majority of the incidents in high concentrated areas had only 1 total dead/missing individual. We did not expect this unusual result as we thought that high concentration areas of incidents would have high numbers of total dead/missing per incident, but that data illustrates that this is not the case. We also sought to understand why certain towns and cities had a high number of reported incidents and found that reporting bias, elevation and proximity near legal bordering may have played a role in the amount of migration incidents. For example, certain regions may be so uninhabited that migrant-related deaths simply may go unreported. This may indicate that certain zones on the US-Mexico border are simply being unwatched. In addition, the fact that the majority of the incidents were nearby in proximity to exit ports suggest that illegal migration is occurring in areas simply right next to the legal border zone which could indicate that these areas need to be watched more heavily.

In our second question, we looked to determine how different variables help to make predictions about total dead/missing. We found that our most successful model predicted the log(totaldeadmissing) as a function of childrenper + femalesper + factor(sourcequality) + factor(new_region) + factor(month) + childrenperfemalesper + childrenperfactor(sourcequality) with an R-squared of .8087, and an RMSE of 46.87 on the training set and 29.29 on the test set. Our model was moderately successful, but would have fared better if we had a more complete dataset with less missing and misleading values. In the future, a response variable with more variation should be used because in our case, the total deadmissing was almost always equal to 1, which means the model could just as easily guess the value of 1 and be correct. Since we have geographic coordinates for each incident, and we observed a pattern between climate and migration in Question 1, we could combine our dataset with another dataset that contains information about weather and climate and use those as a form of predictors.

Overall, the migration issue has been a long term debate in the American society. Our data analysis unveils the reality of migration and provides an opportunity for the public to generate fair judgements and empathy toward migrants.