
Learning From Health
A data science project that leverages CDC data to develop predictive models for income based on health statistics like diabetes and cholesterol, exploring the correlation between health disparities and economic status.
Role
Data Scientist
- Timeline
- January 2022
Technologies
- Python
- Pandas
- Matplotlib
- Seaborn
- Scikit-learn
Tools
- Jupyter Notebook
Background
Recent trends show an inverse correlation between obesity rates and household income, influenced by geographical, cultural, and racial factors. This project sought to quantify and model the connection between income and general physical health.
Solution
We hypothesized that income could be predicted from health conditions and geographic location. Using CDC data, we developed a random forest regressor model that confirmed this hypothesis, demonstrating a strong predictive link between health indicators and income levels.
Process
- Identified existing background research, found relevant datasets from the CDC, and defined a specific research question.
- Performed extensive data cleaning and pre-processing to prepare the data for analysis and modeling.
- Carried out Exploratory Data Analysis (EDA), developed and compared models, and interpreted the results in a final report.
Final Product

Impact
- Developed a random forest regressor model capable of predicting income based on health data with an accuracy of 99.62%.
- Demonstrated a strong, near-linear relationship between true income and the income predicted by the model.
- Successfully conducted an end-to-end data science project, from data collection and cleaning to analysis and interpretation.
Reflection
- Most health conditions analyzed showed positive correlations with each other and a negative correlation with income, confirming initial assumptions.
- The dataset has a major limitation in that it excludes uninsured people, which would likely skew the data further and is a key consideration for future work.