Modeling Food Access Inequality in Los Angeles

How can one determine which socioeconomic indicators are most related to food insecurity in order to determine at risk communities?

Naman Casas

December 12, 2021

Introduction

The United States Department of Agriculture defines a food desert as a geographical area in which the majority of the population do not have access to healthy, affordable food options (povertyusa.org). In an urban zone like Los Angeles California a food desert would be more specifically a 1-mile radius where the majority of the population does not have access to affordable, healthy food options (povertyusa.org). Furthermore, food deserts are also defined by the socioeconomic status of its geographic area, specifically if this area has a poverty rate over 20% (ers.usda.org). When faced with low food access, people are forced toward unhealthy options, especially fast food options. In these situations, even when reintroduced to healthy food options through grocery infrastructure, researchers find affected populations continue rely on the cheaper, fast-options as it carries a lighter economic burden (National Research Council 2009). Ultimately, communities that live in food deserts are afflicted with high rates of cardiovascular disease, diabetes, and various other diet-related diseases (National Research Council 2009).

In large cities the existence of food deserts exacerbates these problems at an even larger degree being that their larger population density leads to drastically larger number of people living in these zones of low food access. In Los Angeles specifically 1 in 10 people experience food insecurity after a spike in insecurity rate during the height of the COVID-19 pandemic (Miller 2021). Furthermore, only 1 in 4 of these food insecure households were on the country's Supplemental Food Assistance Program (Miller 2021). Most communities Los Angeles rely heavily on local corner stores for grocery runs. Although convenient, these corner stores carry heavy stocks of cheap yet unhealthy food and drink options which become the first choice for shoppers in most all low income populations (Los Angeles Food Policy Council 2017). Being that food deserts are clearly tied to socioeconomic indicators, understanding which socioeconomic indicators are most related to the existence of food deserts and to what degree they cause their occurrence could be used determine the communities at the highest risk of food insecurity.

In an attempt to model the spatial relationship between socioeconomic/demographic indicators and food insecurity, this analytical methodology utilizes correlational regression analysis. This methodology uses exploratory regression, generalized linear regression, local bivariate relationships, and geographically weighted regression to determine the ideal socioeconomic/demographic indicators to use for modeling food insecurity, how these indicators relate to the dependent variable of insecurity, and what the best spatial model analysis to use for predicting food insecurity metrics. Grocery store data was pulled from the County of Los Angeles open data site and the demographic information used was pulled from Esri's US Census demographic data which was acquired from the Los Angeles GeoHub site. Ultimately, this analytical workflow will walk one through how food inequality is defined through grocery store density and how demographic information is tied to each grocery store in Los Angeles, how to determine the ideal socioeconomic/demographic indicators to use in regression analysis through observation of linear relationships and exploratory regression analysis, the results of both Generalized Linear Regression and Geographically Weighted Regression on food access, and finally which regression tool best modeled food insecurity in the City of Los Angeles.

Data Wrangling

Filtering & Geocoding

Being that the grocery store dataset is formatted as a .csv file the first method of this methodology was to geocode the addresses available in the dataset. This plots all grocery stores and restaurants in the dataset to LA County to a point on the map scene. Next, I selected this point dataset by its city and facility type attributes in order to filter the data into a point dataset of only grocery stores in our spatial extent of the City of Los Angeles.

Defining Food Access & Explanatory Demographics

Created a 1-mile buffer of the final grocery store point dataset produced from the filtering & geocoding workflow above.
Used Spatial Join to join the original grocery store point data to the buffer zones produced in the step previous (match option = completely contains). Join count product field is the number of grocery store points that fall in each buffer zone.
Used the Join Field tool to transfer the Join Count values produced in the step previous to the original grocery store point dataset. Joined using shared FACILITY ID field.
Used Spatial Join to join the demographic information from an LA Census dataset to the the grocery store point dataset produced above. Made the Match Option intersect which joins the demographic information from the census polygons to any grocery store points that fall within it. This consolidates both the dependent variable (Join Count field) and explanatory variables (LA Census Demographic data) into one dataset.

Regression Methodology

Exploratory Regression

**III. Scatterplot Matrix of Exploratory Variables & Corresponding Pearson's R Value**

Before conducting any regression modeling, I initially had to determine which demographic indicators to use as explanatory variables in the regression model. By plotting all demographic indicators against my food inequality metric in a Scatter Plot Matrix, I observed the linear relationships between variables (bottom left of matrix) as well as quantified their similarity using the Pearson's R metric (top right of matrix). From here I used the exploratory regression tool to determine my list of explanatory variables to model my Join Count dependent variable. Of the 7 attributes chosen, Unemployment Rate, Diversity Index, Median Household Income, Per Capita Income, and Median Household Value were determined to be the best explanatory variables with an Adjusted R-Squared value of ~0.32 (see table IV below).

**IV. Results of Exploratory Regression** (see top row/OBJECTID 44 for ideal regression results)

Generalized Linear Regression (GLR)

**V. Generalized Linear Regression Tool Setup**

The two steps to setting up this GLR model are first deciding the dependent and explanatory then choosing the model type to run. The dependent variable for this GLR model is my join count variable which indicates the density of grocery stores in a one mile radius of the chosen grocery store. The explanatory variables for this model were determined in the exploratory regression analysis conducted previous which determined the ideal variables to be unemployment rate, diversity index, median household income, per capita income, and median home value. Finally, the I chose a Continuous (Gaussian) model type for this GLR analysis as a test run using the Poisson Model Type garnered less accurate results.

Local Bivariate Relationships

**VI. Local Bivariate Relationship Tool Setup**

Next, I used the Local Bivariate Relationships tool to determine any local between my each of my (see list below) explanatory variables and my grocery stores density dependent variable. I used the default settings of 90% confidence and 199 permutations, however, I did decide to increase the number of neighbors to 100 neighbors rather than the default 30 being that any value lower would cause spatial collinearity in my GWR analysis.

Unemployment Rate
Diversity Index
Median Household Income
Per Capita Income
Median Home Value

Geographically Weighted Regression (GWR)

**VII. Geographically Weighted Regression Tool Setup**

Firstly, to properly set up the GWR tool I chose the same join count attribute that expresses relative grocery store density as the dependent variable for the model and chose the same exploratory variables previously used in the GLR model. The Model Type remained as Continuous (Gaussian) in order to maintain consistency across both models. To define neighborhoods for the regression model I chose a user defined number of neighbors and used the maximum number of neighbors possible (100 neighbors) as this allowed the model to avoid failure due to multicollinearity. Due to the large count and density of the dataset (4,000+ grocery stores in the City of Los Angeles) this neighborhood size should not be too large a size to water down the local spatial relationships in play. Finally, I chose a Bisquare local weighting scheme in order to avoid influence from further features. In this methodology I assume that after grocery store access exceeds a preferred distance, those searching for food will weigh other food options. Furthermore, after multiple trials with both Gaussian and Bisquare weighting schemes, Bisquare produced a more accurate model.

Results & Implications

Generalized Linear Regression (GLR)

I. GLR Standard Residuals Values *expand for legend*

GLR in ArcGIS Pro produced both a map of residuals as well as a statistical summary of the model which includes regression diagnostics. In the residual map on the right concentrated zones of under and over-prediction can be observed with the over-prediction primarily found in the southmost extension of the city and west at Los Angeles International Airport whilst the primary zone of under-prediction directly west of Downtown Los Angeles near MacArthur Park. That being said, the histogram of distribution of residual count relative to a normal distribution below indicates that the GLR model produced is over-predicting grocery store density as the peak residual value is between -0.6 and -0.4. The model's Adjusted R-Squared value indicates that around 32% of the independent variable in the study area is accounted for in this model. That being said, the statistically significant Jarque-Bera Statistic confirms the non-normal distribution of residuals whilst the statistical significance of the BP statistic indicate nonstationarity, priming the data for a Geographically Weighted Regression.

**II. Distribution of Standardized Residual Values** (Left) & **III. GLR Results Summary & Diagnostics** (Right)

Local Bivariate Relationships

III. Median Home Value LBR (expand for legend)

The outputs of a Local Bivariate Relationships analysis are a unique colors map of the initial dataset indicating the relationship between the explanatory variable and dependent variable at each point (Figure III). as well as a statistical summary of the results of the LBR (Figure IV). The presence of blue, orange, and yellow values in spatial clusters across the study area (see Figure III) indicate more complex spatial relationship than purely linear relationships. This is observed in all of the LBR output maps. Furthermore, to confirm the presence of these relationships, the LBR categorical summaries (see Figure IV) count the number of features in the dataset that reflect each spatial relationship. In each LBR summary we can see the majority percentage of relationships are defined by either a concave, convex, or undefined complex spatial relationship which justifies the transition to Geographically Weighted Regression which can better model these complex, non-linear relationships.

**IV. Local Bivariate Relationship Summaries** (Join Count vs. Explanatory Variables)

Geographically Weighted Regression (GWR)

**V. Distribution of GWR Standardized Residual Values**

Chart IV below illustrates the evident difference in standardized residual value distribution across the study area as the areas in the GLR model with clusters of under-prediction have a balance of low and high standard residual values in the GWR model. This, in fact, is observable across the study area. Chart V confirms this observation as its count of standardized residual values closely follow a normal distribution with the most common value being the mean of zero and only slight under-prediction.

VI. GLR Standardized Residual Values (left pane) vs. GWR Standardized Residual Values (right pane)

Figure VII below is the output raster produced by the GWR analysis which illustrates the spatial relationship between the diversity index explanatory variable and relative grocery store density dependent variable. The GWR tool produces a raster output for each explanatory variable as well as the intercept which visualizes the spatial variation between the dependent and explanatory variables in play. In this diversity index output raster the are of highest spatial variation (most positive red color values neighboring most negative green color values) is centered around and to the west of downtown Los Angeles whilst the rest of the study area remains fairly heterogeneous. This spatial relationship of the highest relational variation occurring around downtown Los Angeles, to one degree or another, is observable in all output raster with the primary difference between raster being the magnitude in peak values of each spatial relationships.

VII. Example Diversity Index GWR Output Raster

Firstly, the R-Squared value, which measures the goodness of fit of the model, explains 97% of the dependent variable in the study area which is a drastic 65% improvement from the GLR model. To confirm this GWR model's superiority to its GLR counterpart we must look to the AICc diagnostic. The GWR AICc value of 34,000 is smaller than the GLR AICC value of 47,000 which confirms that the GWR model is a better fit for the grocery store data in play. Ultimately, the evidence provided concludes that this GWR model can successfully model the determined metric of food access by grocery store density.

Conclusion/Future Iterations

Despite the statistical success of the final GWR model for food access this methodology was not void of error nor the potential to be improved. Firstly, if I were to improve my regression modeling, in future permutations I would look for a wider variety of demographic indicators to act as explanatory variables. Although the indicators I chose for this methodology were ultimately highly successful in predicting relative grocery store density within my 1-mile buffer radius, initial runs GWR analysis with smaller neighborhood sizes resulted in failed trials due to multicollinearity. Certain variable combinations such as Median Household Income and Per Capita Income could be the cause of this multicollinearity and would be ultimately detrimental to the accuracy of the Weighted Regression model due to their redundancy. Furthermore, the only warning message received for my final GWR output was that one explanatory variable used in the model expressed enough linear relationships amongst the study area data points to not be useful for local weighted regression. Ultimately, updating the list of demographic indicators of inequality to use as exploratory variables should prove useful for improving the GWR Methodology.

Although updating the specificity of my regression methodology would improve future iterations of this methodology, the largest margin of improvement would come from reworking how the dependent variable of food access is derived and expressed. This projects defines food access by determining the density of grocery stores in a one mile radius around each existing grocery store in the City of Los Angeles. A more accurate quantification of food access would reflect the qualities of low food access as defined by the USDA and would instead calculate the percentage of the city's population within a 1-mile radius of a grocery store that also had access to a vehicle. The resulting metric could then be graded to determine a food access statistic to be used as the dependent variable for future model permutations. Furthermore, this metric would account for the affordability and health of product provided in each grocery store as well as the relative wealth of each community in question. Although the model put together in this workflow properly explained the food access metric I produced, in order to maximize this models accuracy I must produce the most accurate dependent and exploratory variables in order to properly capture the information and metrics they are intended to portray. Finally, being that these regression models are correlational, future iterations of this spatial question could turn to predictive modeling to attempt to prevent food access inequality by looking for areas in the City of Los Angeles to prime for grocery store expansion.