Data Science for Global Health: Using Machine Learning for Sustainable Food Security

Data Science for Global Health: Using Machine Learning for Sustainable Food Security

Challenge


Climate change, shifting soil conditions, and population growth are making it increasingly difficult to secure equitable global access to nutritious food. To address this challenge, it is essential to identify which crops can thrive productively in different regions to support both food security and long-term health outcomes. This project explored how machine learning can analyze agricultural and environmental data to recommend optimal crops and inform strategies of reducing hunger and improving global health.

"How might we use data to guide crop decisions that support higher yields and better global nutrition?”

Methods


Literature Review – Researched agricultural and environmental studies to understand how crop selection affects food security and health outcomes.


Exploratory Analysis – Used R (ggplot2, dplyr, cluster) for data cleaning, visualization, and descriptive analytics, including K-cluster analysis to segment regions by soil, climate, and yield conditions.


Predictive Modeling – Tested multiple machine learning approaches in R, including Regression Trees, Boosting, Bagging, and Neural Networks. Through comparative evaluation, Random Forest emerged as the best-performing model for predicting crop yield potential.


Prescriptive Analytics – Used Python to run non-linear optimization models that translated predictive outputs into actionable crop recommendations, balancing productivity with risk.

Insights


Food demand is outpacing supply. The global population is projected to reach 9.7 billion by 2040, with nearly 9% of people already underfed.


Regional data matters. Cluster analysis revealed distinct groupings of regions by soil and climate conditions, emphasizing how crop planning must be tailored to local conditions.


Climate change is shrinking yields. Heat stress and shorter growing seasons are threatening crop production. Due to the varying impacts of these factors by region, traditional tools for crop planning are falling short as current crop simulation models rarely incorporate stressors like extreme weather or pests.


Data analytics can improve outcomes. Previous scholarship demonstrates how leveraging data-driven crop management can improve yields more than climate alone reduces them.


Random Forest was the strongest performer. The model proved more accurate than regression trees, boosting, bagging, or neural networks, demonstrating its effectiveness for forecasting crop yields under varying conditions.

The Random Forest model identified the top-performing crop varieties, and non-linear optimization produced the ideal planting ratio. V95 received the largest allocation of farmland (25%), demonstrating the system’s ability to distinguish subtle differences between varieties and generate impactful recommendations.

Outcomes


Integrated climate and soil data into a predictive model to identify not only the highest-performing crops for limited farmland, but also those that are most resilient under unstable environmental conditions.


Built and validated a Random Forest model that accurately estimated regional crop yields across varying conditions.


Applied the model to a target farm and identified a regional crop combination that incorporates those crop varieties that will be most productive and resilient in the target environment.

By applying machine learning to environmental and weather data, this project demonstrated a reproducible framework for identifying optimal and resilient crop varieties to address resource constraints and global nutrition needs.

© 2025 All Rights Reserved