Learning From Health
A data science project that leverages CDC data to develop predictive models for income based on health statistics like diabetes and cholesterol, exploring the correlation between health disparities and economic status.
Role
Data Scientist
- Timeline
- January 2022
Technologies
- Python
- Pandas
- Matplotlib
- Seaborn
- Scikit-learn
Tools
- Jupyter Notebook
Background
Learning From Health is a data science project that analyzes the relationship between health indicators and household income using CDC datasets. The project aimed to quantify correlations between health conditions like diabetes and cholesterol with economic status. I worked as part of a team to develop predictive models that could estimate income levels based on health statistics and geographic factors.
Solution
We developed a random forest regressor model using Python and scikit-learn to predict household income from health indicators and location data. The model achieved 99.62% accuracy and demonstrated a strong predictive relationship between health conditions and income levels. I contributed to data preprocessing, feature engineering, and model evaluation. The analysis revealed significant correlations between various health conditions and economic status, with geographic location serving as an important predictive factor.
Process
I began by researching existing literature and identifying relevant CDC datasets for the analysis. I performed extensive data cleaning to handle missing values, outliers, and inconsistent formatting across multiple data sources. I conducted exploratory data analysis using Pandas, Matplotlib, and Seaborn to visualize relationships and identify key features. I developed and compared multiple regression models, ultimately selecting a random forest approach for its performance. The final step involved interpreting results and documenting findings in a comprehensive report.
Final Product
Impact
The random forest model successfully predicted income with 99.62% accuracy, demonstrating a strong correlation between health indicators and economic status. This analysis provided valuable insights into health disparities across different income levels and geographic regions, revealing patterns that contribute to our understanding of the complex relationship between socioeconomic factors and public health outcomes. The project's findings highlight the importance of considering both health and economic data when addressing public health challenges.
Reflection
This project improved my skills in data preprocessing and feature engineering with real-world datasets. I learned to navigate the challenges of working with public health data, including handling missing values and ensuring data quality. The analysis revealed important limitations, particularly that the dataset excluded uninsured populations, which could affect model generalizability. If I continued this work, I would explore additional data sources to address these limitations and validate the model's performance on more diverse populations.