INTRODUCTION

Having a shared affinity in the realm of healthcare and health insurance, Group 11 found an interesting collection of data published by the Center for Disease Control and Prevention (CDC). Group 11 found a particular area of the data to be strikingly interesting in relation to healthcare and health insurance. In the era of continual political polarization and turmoil within the modern United States, access to healthcare remains an ongoing, important issue. The 2007-2008 housing market collapse and the subsequent worldwide economic recession also marked a point of severe financial strain for Americans. Given concerns that the bubble-burst cycle is to be expected in America’s economy, understanding this aspect of the past could bring to light the implications of new economic difficulties in the future. Group 11 delved deeper into health outcomes and health insurance related factors contained in this collection of CDC published data-sets.

The first question Group 11 posed about the CDC data is “Can income, employment status, and vaccination status predict whether individuals have health insurance?” It is generally thought that those with hefty salaries have no issues with access to health insurance. However, lack of health coverage can be devastating to individuals who must face large medical bills for themselves or for loved ones. In this regard, Group 11 aimed to see if this preconceived trend holds true in reality. Access to health insurance is an ongoing issue in the socioeconomic realm within the United States and has had large implications for policy based decisions. For example, the Affordable Care Act of 2010 strives to provide health coverage to those who may not have access due to their economic standing. Group 11 looked to explore if factors such as income levels, vaccines, and employment status can predict health insurance coverage status and how modeling results could shape the necessity of low income health insurance coverage plans.

The second question Group 11 explored is “Can insurance coverage and vaccination status predict the proportions of self-rated general health responses?” This question looked to examine if individuals that have insurance and are vaccinated have higher self-rated well-being and general health. If true, this question would support the necessity for individuals to have access to vaccinations and healthcare. If proven true, this question also has the ability to further support the political initiatives installed by the Affordable Care Act.

Various institutions and organizations could find the implications of these questions imperative to their own decision making and policy shaping behavior. The CDC, the owner of the data, is a government agency that strives to protect the American peoples’ public health through the prevention of disease, injury, and disability. Both of Group 11’s questions had a purpose to produce tangible results that could motivate action to better protect Americans through the implementation of health-related policies. Furthermore, decisions within the US domestic healthcare spectrum may influence adjacent decisions made on an international scale.

DATA

Group 11 chose to analyze data from the Behavioral Risk Factor Surveillance System (BRFSS), a system of surveys conducted by the CDC. The data was collected and published on an annual basis; Group 11 analyzed data from 2001-2010 to observe differences between pre and post recession modeling. The data contains more than 400,000 observations each year and provides insight into health risk factors and outcomes. According to the CDC’s website, BFRSS is the “largest continuously conducted health survey system in the world.” Group 11 chose to analyze this data due to its magnitude and the modern relevance of healthcare-related concerns within policy decision making.

The data was collected by state health departments through phone surveys using standardized questionnaires. The sampling process involved random selection of phone numbers from each area code. Each selected phone number could be called up to 15 times to attempt to receive a response. Once an eligible household was contacted, interviewers conducted the survey based on a set of predetermined questions. Standard core questions were asked each year and were included in each state’s questionnaire. A separate set of questions were asked every other year. Additionally, states could include optional modules or add their own questions to the survey.

Income	n
<$10k	174895
<$15k	185045
<$20k	241882
<$25k	297822
<$35k	397294
<$50k	493485
<$75k	496565
More than $75k	705077
Don’t Know/Not Sure	219491
Refused	256300

Employment Status	n
Employed for wages	1589197
Self-employed	303628
Out of work for >1 year	68715
Out of work for <1 year	89490
Homemaker	274988
Student	71389
Retired	847465
Unable to work	212716
Refused	10902

Insurance Coverage?	n
Yes	3058404
No	402443
Don’t Know/Not Sure	5761
Refused	3290

The “INCOME2” variable represents annual household income from all sources. Recorded values between 1-6 reflect different income brackets (from lowest to highest), and Group 11 re-coded these responses as factors to aid in data visualization. An individual’s employment status is found in the “EMPLOY” variable, and response categories can be seen in the table above. Group 11 also re-coded these values from recorded numbers to reflect the employment status that each response represents. The “HLTHPLAN” variable reflects whether an individual has any type of healthcare coverage. The question asked in the survey specifies that the coverage can include health insurance, prepaid plans, or government plans like Medicare. An observed value of 1 represents an individual who responded “YES” to having coverage, while a value of 2 represents an individual who responded “NO.” Other responses were filtered out within the modeling process for sake of clarity.

Pneumonia Shot?	n
Yes	975096
No	2237507
Don’t Know/Not Sure	224513
Refused	5030

Self-Rated General Heatlh	n
Excellent	660207
Very Good	1107639
Good	1042752
Fair	450411
Poor	194977
Don’t Know/Not Sure	7710
Refused	6182

Flu Shot?	n
Yes	1411659
No	2017598
Don’t Know/Not Sure	11504
Refused	4029

The pneumonia shot variable was renamed to “PNEUVAC3” to be consistent throughout all 10 years, and the question asked each year was if the individual had ever received a pneumonia vaccine. The 1 values were re-coded to reflect a response of “YES,” the 2 values were re-coded to reflect a response of “NO,” and the 7 and 9 values were re-coded as “DON’T KNOW/NOT SURE” and “REFUSED” respectively. The coding of these variables is consistent with how the other variables were recorded in the data codebook. The “FLUSHOT” variable was named differently within different years, but Group 11 renamed the variable to be consistent within all years. However, the question asked each year is the same; the individual is asked if they have received a flu vaccine within the past 12 months. A value of 1 represents a response of “YES,” while a value of 2 represents a response of “NO.” Similarly to other variables, Group 11 filtered out other responses (such as those who refused to answer) for the modeling process. The “GENHEALTH” variable represents an individual’s self-rated health. The respondent is asked whether their general health is Excellent (value of 1), Very Good (value of 2), Good (value of 3), Fair (value of 4), or Poor (value of 5). Individuals who respond “DON’T KNOW/NOT SURE” are recorded as a value of 7, and individuals who refused were recorded as a value of 9. Group 11 chose to analyze responses between 1-5 for modeling.

The figure above provides a visualization for two key variables - flu vaccination status and self-rated health. The purpose of the figure is to provide a visualization of the distribution of the two variables; in general, most of the responses are either “Very Good” or “Good.” Additionally, the figure reflects how flu vaccination status varies within each response class. Group 11’s modeling will continue to explore these differences and attempt to provide predictive value.

RESULTS

In order for Group 11 to explore the capabilities of income, employment, and vaccination status to predict health insurance status, the group first chose to utilize the log-odds function. This was used because of the predicting power of a binary variable (a person who has a health plan versus a person who does not have a health plan). Original attempts to construct a classification model using k-Nearest Neighbors failed due to the limited number of variables to use, resulting in numerous ties in predictions. While k-Nearest Neighbors does not need to rely on distribution assumptions, the dominance of binary variables available from the dataset still resulted in ties, even when binary variables were cut down in different iterations of the model. The data was then split from the original 2001 - 2010 years to a range of 2001 - 2006 and 2007 - 2010 as a partition in regards to the 2007/2008 recession. This allowed for comparison between the models themselves and how they differed in predicting the dynamics of insurance amidst economic instability. After the glm model was constructed, the following analytic steps were taken. First, ANOVA was utilized to verify the significance of the variables within the model, as well as any variability that would impair the model. Next, an ROC curve was plotted to calculate general accuracy, because ROC plots the false positive rate versus the true positive rate. Therefore, the ROC curve is a measure of how well a parameter can distinguish between two diagnostic groups and in this case, who has insurance and who does not. The area under the ROC curve showed 80% accuracy. Finally, K fold cross validation was implemented on out of sample data and a confusion matrix was created to see how well the model predicted insurance.

Pre-Recession Confusion Matrix

	0	1
0	1894	1738
1	12841	103983

Post-Recession Confusion Matrix

	0	1
0	1184	955
1	7176	69969

After the construction of the best model, Group 11 found that the main limitation of the model was brought about by the original data source. The data was heavily skewed in regard to insurance responses, resulting in about 85% of respondents having a health plan. While this is not at all unexpected, this does limit the number of cases available to the model. Thus, the confusion matrices featured a strong specificity (.89 for the pre-recession model and .90 for the post-recession model), but considerably weaker sensitivity (.53 for both). Despite stratifying the samples used for the test and train sets, the model does not demonstrate improved sensitivity. This unfortunately hinders what could have been the most valuable use of this model. Even though the Chi-squared test used for the analysis of variance for the model indicated significance, this evidently only extends to true positives. Interestingly, being a student had no significant effect on whether an individual had health insurance. An evaluation of the coefficients involved in the plot shows this dominance in the data. Income is evidently the strongest predictor, with less variation at higher income brackets. Those with both the pneumonia and flu vaccines were more consistently predicted correctly. Using the pneumonia vaccine as a descriptive proxy for older individuals did not make an isolated impact apparent, with both vaccines seemingly having the same effect even though the pneumococcal vaccine is intended for older individuals. On the other hand, those with neither vaccines had the greatest variation, with much more uncertainty in the odds calculated by the model. Unexpected trends involved how those who were unable to work actually tended to have insurance and were well-predicted as such. This suggests other parameters may be interacting to make this possible. Being better predicted than those who were employed leaves these trends unexplained in the current scope.

Despite previous setbacks, Group 11 compared the models prepared for both the pre-recession data and post-recession data. The pre-recession model was used to predict on post-recession data, which yielded surprising results. Both models predicted with near identical accuracy, indicating that the model trained on pre-recession data was no worse or better at predicting post-recession trends than a model that was trained on post-recession data. This suggests that the model was not at all complex enough to identify differences in conditions of the two time periods or that any possible effects on insurance access that may have occurred with the recession involve variables beyond what is available in the BRFSS dataset. The lack of strong effect for the flu shot variable along with the connection of the pneumonia vaccine to older age groups may also indicate that these variables are not terribly effective predictors despite prior ANOVA evaluation.

For Question 2, Group 11 used a proportional odds logistic regression model because it is well suited for predicting an ordered multi-class response variable. The model assumes that the dependent variable, an individual’s general health, cannot be perfectly predicted from the independent variables, whether or not a person has insurance or has been vaccinated. Instead, the model provides a probability for each response class for a given predictor case. The predictor case included both the flu shot and the healthcare variables, so the predictor cases were comprised of combinations of possible outcomes of the two variables (such as having both healthcare and having received a flu shot). Before setting up the model, Group 11 filtered out the data to only include YES/NO responses, since refusals, unsure responses, and missing responses could not be factored into the modeling. After using the model to produce distributions over the response classes over the 4 predictor cases using train data, Group 11 compared each distribution to true observed ratios using test data. The following figures illustrate the ratio of proportion of self-rated health responses based on their insurance coverage and flu vaccination status.

Residuals

Shot	Plan	Excellent	Very Good	Good	Fair	Poor
Given Flu Shot	Have Health Plan	0.001347060	0.004643182	-0.013907301	0.0005459819	0.007371078
Given Flu Shot	No Health Plan	0.025849017	-0.008427303	-0.002136959	-0.0082628654	-0.007021890
Not Given Flu Shot	Have Health Plan	-0.001835404	0.001868956	0.005921270	-0.0030665992	-0.002888222
Not Given Flu Shot	No Health Plan	0.008378982	-0.030162439	0.027414591	0.0090729097	-0.014704044

The bars in the figures represent the actual vs. predicted responses of how individuals described their overall health. Overall, the model predicted ratios of self-rated health responses very well for each predictor case. Interestingly, a higher ratio of respondents with health insurance who did not receive a flu shot believe their overall health is “excellent” when compared to individuals with coverage who did receive a flu shot. A similar trend was observed within responses for individuals without coverage; the trend can be seen in both actual and predicted variables. However, one of the largest residuals was observed within the no coverage/received flu shot predictor class - a higher proportion of actual responses were “Excellent” compared to the predicted proportion, while lower actual proportions were seen within the “Very Good” and “Fair” responses. Within the no coverage/no flu shot predictor class, larger residuals can be seen in the “Very Good” and “Good” responses; the proportion for “Very Good” was overestimated while the proportion for “Good” was underestimated. A key takeaway from the modeling is that individuals who received a flu shot in the past 12 months have worse self-rated health, for both coverage and no coverage classes.

Group 11 believes that these findings reveal more surprising information about mindsets of individuals who choose to receive flu vaccinations than about health insurance coverage. Going into the modeling, Group 11 expected general health scores to be worse for individuals who do not have access to insurance coverage; this expectation can be seen when comparing classes with the same vaccination status but different insurance coverage. However, Group 11 was surprised to learn that individuals who did not receive a flu vaccination had higher self-rated health scores for both insured and non-insured classes. A possible explanation for this phenomenon is that individuals who choose to not receive a flu shot believe that they are healthy enough to not need the vaccine. Another possible explanation is that individuals who make an effort to get a flu vaccine are more conscious about their health, leading them to give a more conservative estimate of their overall self-rated health. However, the findings of Group 11’s modeling bring up important policy implications for both insurance coverage and for vaccinations which will be explored further within the conclusion.

CONCLUSION

Throughout the modeling of BRFSS data, Group 11 sought to analyze the implications and predictors of healthcare coverage. The goal of Question 1 was to predict healthcare coverage status based on inputs of employment and income, while the goal of Question 2 was to use healthcare coverage along with flu vaccination status to predict distributions of self-rated health. For Question 1, a key finding was that the modeling used had similar accuracy when use on both pre and post recession data, potentially requiring more complex analysis into the relationship between economic recession and health. Additioinally, modeling found that while including vaccination status made predictions more accurate, income remains the strongest predictor for whether an individual has insurance coverage. Modeling for Question 2 was effective at predicting distributions within each class - analysis of each distribution found that for both groups with and without insurance coverage, individuals who received flu shots rated their perceived general health as lower than their vaccinated counterparts. Additionally, the class with the lowest proportion of “Poor” health responses was the group with insurance coverage who did not receive a flu vaccine, while the class with the highest proportion of “Poor” responses was the group without insurance coverage who did receive a flu shot. Individuals who have healthcare coverage and received a flu shot had a higher actual proportion of “Poor” responses compared to individuals without healthcare coverage and without a flu shot. This difference is incredibly surprising when considering that these groups should hypothetically be the healthiest and least healthy, respectively. The findings reflect stark differences within self rated health between different predictor classes, prompting further research on the implications of self-rated health scores.

To gain a more holistic application of the modeling, finding a dataset that applies much of the variables observed to the entire population would be necessary. Many of the questions in the original datasets, such as those pertaining to which doctors individuals see, multiple resources, evaluation of physical and mental symptoms as pertaining to major disorders, and descriptors of employment were only asked to very small subsets of respondents. Generally, these more nuanced questions were only asked to around 1,000 individuals or less, even if they could have been asked to the entire population. This was by study design; if one question received a certain response, i.e. “No”, then several other questions would have never been asked. The inherently skewed distribution of the health plan responses may necessitate focusing on each response as separate populations to tease associations with greater clarity. Predicting on multiclass variables such as the reason for lack of coverage would necessitate much more complex classification models, such as decision trees or lazy learning-oriented approaches. k-Nearest Neighbors attempts were completely unusable given the skewed proportions, which is why other methods should be sought out.

When looking at the policy implications of Group 11’s modeling, there is not a single path forward. Although predicting whether an individual has insurance coverage is tricky, both employment and income do appear to have a relationship with probability of having access to coverage. Thus, it is critical to not only improve access to health insurance, but to also work on increasing both the levels and quality of employment, leading to better income and ability to afford healthcare. Additionally, there seems to be a relationship, although limited, between health insurance and vaccination status (both for flu and pneumonia vaccines). Thus, to further improve accessibility to vaccinations, expansion of healthcare coverage should be prioritized in the context of health policy. However, although vaccines are scientifically backed in their health benefits, individuals who do not receive vaccines do not seem to believe they are important, since respondents who did not receive a flu vaccination had higher perceptions of their own general health. Because of these differences explored in the second model, it is critical to prioritize health education, both within the education system and within the adult population. Further research should be conducted on societal perceptions of vaccines and relationships to overall health. Ultimately, it is critical for government bodies to continue research on health risk factors and outcomes, as determinants of health often impact nearly every facet of government and social policy.

Final Paper: Healthcare CDC Data Exploration

STOR 320.02 Group 11

December 05, 2019