California Food Affordability Analysis¶
Overview¶
This project analyzes food affordability across places in California and investigates how affordability varies with socioeconomic, geographic, and demographic context. The motivation is to understand where affordability burdens are highest and which contextual predictors are most informative, using interpretable statistical tests and simple regression models.
The analysis is organized into four research questions (RQs), implemented as notebooks. Across the project we emphasize reproducibility by (1) using a shared utilities module, (2) saving outputs to disk, and (3) including unit tests for utility functions.
Dataset¶
We use the Food Affordability indicator published by the California Department of Public Health on the State of California Open Data portal. The dataset describes the average cost of a nutritious market basket relative to income for female-headed households with children, reported for California and multiple geographic levels (regions/counties/places).
Source: https://
Temporal coverage: 2006–2010
Publisher: California Department of Public Health
Project Website¶
The project’s JupyterBook website can be accessed here
Repository Structure¶
main.ipynb: Main narrative notebook (project summary + key results)EDA.ipynb: Exploratory analysis (tables/figures and initial patterns)model_rq1.ipynb: RQ1 modeling (baseline vs. ridge; small interpretable feature sets)model_rq2.ipynb: RQ2 inference (one-way ANOVA + Tukey HSD)model_rq3.ipynb: RQ3 exploration (summaries/visualizations by key groupings)model_rq4.ipynb: RQ4 modeling (preprocessing + evaluation variants)data/: Input datasetfigures/: Generated figuresoutputs/: Generated tables/metricspdf_builds/: PDF export outputs of the analysis notebooks_build/: MyST build artifactsutils/: Utility module(s)utils/model_utils.py: Shared modeling/stats helpersutils/tests/: Unit tests (pytest)
environment.yml: Reproducible environment specificationconftest.py: Pytest configuration for importsMakefile: Common automation commandsmyst.yml: MyST configuration_toc.yml: JupyterBook table of contentsreferences.bib: Bibliography file for citationsproject-description.md: Short project description / write-up used for the siteai-documentation.txt: AI usage documentation (prompts/tools used, per course policy)LICENSE: Repository license terms
Setup and Installation¶
Clone this repository:
git clone https://github.com/UCB-stat-159-f25/final-group18
cd final-group18Create and activate the environment:
mamba env create -f environment.yml --name stat159-env
conda activate stat159-envpython -m ipykernel install --user --name stat159-env --display-name "IPython - stat159-env"Usage¶
Run notebooks¶
Open JupyterLab and run:
EDA.ipynbmodel_rq1.ipynbmodel_rq2.ipynbmodel_rq3.ipynbmodel_rq4.ipynbmain.ipynb
Figures and tables are written to figures/ and outputs/.
Automation (MyST)¶
Build the MyST site / configured exports:
myst buildBuild PDF exports:
myst build --pdfPackage Structure (utils/model_utils.py)¶
The model_utils module provides small reusable helpers used across RQ1–RQ4.
Metrics¶
rmse(y_true, y_pred): Computes root mean squared error (RMSE) between true and predicted values.
RQ1: Interpretable regression modeling¶
make_ridge_model(feature_list, X_ref): Builds a RidgeCV regression pipeline that:one-hot encodes categorical predictors,
scales numeric predictors,
fits on
log1p(y)and returns predictions on the original scale viaexpm1.
eval_model_rq1(name, model, X_train, X_test, y_train, y_test): Fits a model and returns a metrics dictionary (RMSE,MAE,R2) labeled withname.
RQ2: Group mean inference¶
one_way_anova(df, target, group, min_group_size=5): Runs a one-way ANOVA comparing meantargetacross levels ofgroup, excluding groups with fewer thanmin_group_sizeobservations. Returns{F, p_value, n_groups, n_total}.tukey_hsd(df, target, group, alpha=0.05): Runs Tukey’s HSD post-hoc pairwise comparisons fortargetacrossgroup. Returns a tidy summary table as a DataFrame.
RQ4: Preprocessing + evaluation utilities¶
make_ohe(dense): Creates a version-compatibleOneHotEncoderwithhandle_unknown="ignore"(works across sklearn versions).split_cols(X, feature_list): Splits the provided feature list into categorical vs. numeric columns using pandas dtypes fromX.make_preprocessor(feature_list, X_ref, dense=False, scale_num_for_linear=True): Builds aColumnTransformerthat one-hot encodes categorical columns and optionally scales numeric columns (useful for linear models).wrap_log1p(model_pipeline): Wraps an estimator/pipeline with alog1ptarget transform (andexpm1inverse) usingTransformedTargetRegressor.eval_model_rq4(model, Xtr, Xte, ytr, yte): Fits a model and returns(metrics, predictions), wheremetricsincludesRMSE,MAE, andR2.
Testing¶
Run tests from the repository root:
pytest -qLicense¶
This project is licensed under the BSD 3-Clause License.
Additional Information¶
See project-description.md for the assignment specification and main.ipynb for the final narrative results.
- Lee, S., Zhang, D. Y., Chung, H., Leng, D., Jimmy Butler, & sdny2 berkeley. (2025). UCB-stat-159-f25/final-group18: v1.0.1 – Final Project Release. Zenodo. 10.5281/ZENODO.17970438