Project 4: Spatial Machine Learning Workflow

SSCI - 575, Spatial Data Science

Asjad Asif Jah

November 28, 2021

Introduction

Problem

Median household income information is collected by the American Community Survey (ACS), conducted by the United States Census Bureau. This project evaluates a relationship between median household income per census tract and internet speed variables.

Research Question

The hypothesis being tested in this project is whether median household income per census tract in California significantly correlates with internet speed variables. If the hypothesis holds, internet speed variables can predict the median household income per census tract.

Data

1. ACS Median Household Income Variables - Boundaries

The first dataset used was from the Living Atlas, which contains information regarding the median household income per census tract. This dataset was subset to the state of California, and only the Median Household Income in past 12 months (inflation-adjusted dollars to last year of 5-year range) variable was used. The following map shows the dataset.

ACS Median Income Data of CA by Census Tract -- Symbolized by the median household income

2. Ookla Global Fixed Network Performance Shapefiles

The second dataset used was from ookla’s open data initiative ( https://www.ookla.com/ookla-for-good/open-data ), which contains information regarding the speed tests performed by consumers worldwide. The dataset is available in shapefile format, and the results are represented as averages in zoom level 16 web mercator tiles (approximately 610.8 meters by 610.8 meters at the equator). The dataset contains information regarding download speed, upload speed, latency, tests, and devices per mercator tile. If a test has not been done in that area, the tile is not included in the shapefile. The data is available quarterly, so I merged the data for the last 4 quarters, consisting of the last quarter of 2020 and the first three quarters of 2021. This dataset was also subset to the state of California, and three variables, avg_d_kbps, avg_u_kbps, avg_lat_ms, were used. The following map shows the dataset.

Ookla Global Fixed Network Performance Data of CA merged for the past four quarters -- Symbolized by the average download speed

Approach

After merging, projecting, and clipping the ookla shapefiles to California and clipping the median household income dataset to California, a spatial join was done on the median household income data and ookla data to bring all the data to the census tract level. The average upload speed, download speed, and latency in the ookla mercator tiles were added to the census tracts they intersected with. Next, some exploratory data analysis was done by running local summary statistics to explore the local relationships of the variables. In the end, a combination of non-spatial and spatial machine learning models were run, and the results were compared.

Methods

Following is the flowchart of the complete workflow. The components of the analysis workflow are explained further down.

At the beginning of the analysis, data needed to be brought in the proper format for the analysis to be done, which is known as data wrangling. The ACS income data was already prepared in the sense that it was projected and the income variable was at the census tract level. However, the ookla data needed to be merged together to prepare a dataset of the last one year, as it is available per quarter. Next, the ookla data needed to be projected since it was in a GCS. It was projected to the same coordinate system as the ACS income data. Both the ACS income data and ookla data were then clipped to California. Next, to bring the ookla data to the census tract level as well, a spatial join was done. During the spatial join, the intersect option was used to keep the values of all the ookla variables in the tiles which intersect with the census tracts. The merge rule for avg_d_kbps, avg_u_kbps, and avg_lat_ms was set to mean, and their data type was set to double. The merge rule for tests and devices was set to sum. Once the spatial join was completed, the dataset was prepared for the machine learning models. The following map shows the spatially joined dataset.

ACS Income data joined with the ookla data -- Symbolized by the average download speed

Now that the data was prepared, I did some exploratory data analysis by running Optimized Hotspot Analysis on each dependent and predictor variable to see if there were any clustered high and low values. First, I ran the hotspot analysis on the median income data. It showed hot spots in the bay area and Los Angeles and showed cold spots in the central valley, which was expected. The result of the hot spot analysis is shown in the map below.

Result of the Optimized HotSpot Analysis for the median income variable

Next, I ran the hotspot analysis on the internet download speed, which also showed similar hot spots in the bay area and Los Angeles and cold spots in the rural parts of CA, which was expected. The result of the hot spot analysis is shown in the map below.

Result of the Optimized HotSpot Analysis for the download speed variable

Next, I ran the hotspot analysis on the internet upload speed, which also showed similar results as the download speed, although the cold spots were not all over CA. The result of the hot spot analysis is shown in the map below.

Result of the Optimized HotSpot Analysis for the upload speed variable

Lastly, I ran the hotspot analysis on the internet latency, which showed that there are hotspots in the rural parts of CA, as higher latency indicates lower quality internet and cold spots in the bay area and Los Angeles. The result of the hot spot analysis is shown in the map below.

Result of the Optimized HotSpot Analysis for the latency variable

After the exploratory data analysis, it was evident that there was spatial autocorrelation in the data, i.e., the higher and lower values were grouped and showed similar patterns. The first machine learning model I ran was the Generalized Linear Regression (GLR), a non-spatial supervised learning model. The reason to run GLR first was that it is the simplest model, and if a problem can be solved with a simple model, then there is no need to go for a complex model. Before GLR can be run, it is necessary to check that there is no multicollinearity between the variables which will be used for the regression. I checked the correlations and found no significant relationship between the predictor variables of the ookla data, so all three of them could be used. The following scatter plot matrix shows this:

Scatter plot matrix showing correlation among the predictor variables

I modeled the relationship between median income and internet speed variables with a supervised learning method because the ground-truth values (median income per census tract) are available. So the task of the model would be to learn how the internet speed variables relate to the outcome variable, i.e., the median income. The model's performance would be assessed using the adjusted r-squared value, which would tell us how much of the variance is explained by the model.

After running the GLR, I then ran the Geographically Weighted Regression (GWR) model, which is also a supervised machine learning method but is spatial. The decision to run GWR was made after reviewing the results of the GLR model, which are discussed in the results section. The results of the GWR are also discussed in the results section.

After running GWR, in the end, I ran the Forest-based Classification and Regression (FBCR) model, which is a non-spatial supervised machine learning model. The decision to run FBCR was made after seeing a good R-Squared value of the GWR model. This is discussed in the results section, and the results of the FBCR are also discussed, concluding the analysis.

Results

The result of the first model, Generalized Linear Regression (GLR), is shown in the map below:

Result of the Generalized Linear Regression (GLR) model

On the map, we can see a lot of dark greens, representing the standard deviation of the residual to be greater than 2.5. This means that the model underestimates the median incomes in many census tracts, i.e., the model estimates the median income to be lower than the actual values. It is also worth taking a look at the diagnostics returned by the model. They are shown below:

Diagnostics returned by the Generalized Linear Regression (GLR) model

First, looking at the Adjusted R-Squared value, we can clearly see the model is not doing well. R-Squared quantifies how much of the total variation in the actual data is explained by the regression model. In this result, a value of 0.016871 shows that only 2% of the variation of the actual data is explained by the regression model, which is not a good score.

Next, let us take a look at the Joint F-Statistic, a hypothesis test for checking whether the regression model coefficients are significant or not, which tells us if the model is significant or not. In this result, a P-value of 0 indicates that the null hypothesis can be rejected and the regression model coefficients are significant.

Next, let us look at the Joint Wald Statistic, a hypothesis test for checking whether the predictor variables are significant or some of them should be dropped. In this result, a P-value of 0 indicates that the null hypothesis can be rejected, and all the predictor variables are significant, so they should be kept.

Next, let us take a look at the Koenker (BP) Statistic, a hypothesis test for checking whether the errors (residuals) are stationary around the mean, or in other words, the errors are independent of each other. In this result, a P-value of 0 indicates that the null hypothesis can be rejected. Since this data is spatial, this hints that there might be spatial autocorrelation in the data, which cannot be modeled with a generalized linear regression model and would require a spatially explicit model.

Lastly, let us take a look at the Jarque-Bera Statistic, a statistic for checking the bias in the estimations of the model by checking the skewness of the residual distribution. The model can be either biased towards underestimation or overestimation, both of which are not desirable. In this result, a P-value of 0 indicates that the null hypothesis can be rejected. Since this data is spatial, this also hints that there might be spatial autocorrelation in the data, which cannot be modeled with a generalized linear regression model and would require a spatially explicit model.

Looking at the above results of the GLR, as the next step, I ran the Geographically Weighted Regression (GWR) model. The result of GWR is shown in the map below:

Result of the Geographically Weighted Regression (GWR) model

Compared to the result of GLR, we can see that there are not as many dark green census tracts, showing GWR is doing better than GLR. Let us take a look at the diagnostics returned by GWR. They are shown below:

Diagnostics returned by the Geographically Weighted Regression (GWR) model

We can see that GWR is performing much better than GLR, looking at the Adjusted R-Squared value. In this result, a value of 0.6789 shows that almost 68% of the variation of the actual data is explained by the GWR model, which is a good score compared to GLR. However, it indicates that the relationship is not locally perfect linear as well. So I decided to explore another model.

As the last model, I ran the Forest-based Classification and Regression (FBCR). The result of FBCR is shown in the map below:

Result of the Forest-based Classification and Regression (FBCR) model

Compared to the result of GWR, the result of FBCR does not look better. Let us take a look at the diagnostics returned by FBCR. They are shown below:

Diagnostics returned by the Forest-based Classification and Regression (FBCR) model

We can see that FBCR is performing worse, looking at the R-Squared value. The R-Squared value of the training data is 0.91, but for the validation data, it is 0.04, which is poor. So, it is evident that FBCR cannot be used to describe the relationship.

Discussion

The results of GLR and FBCR show that the relationship between the median income variable and the internet speed variables is insignificant. Although the R-Squared value of the spatially explicit model, GWR, was considerably higher than the non-spatial models, it is still not enough to establish a strong linear relationship between the variables, even locally. However, there certainly is a weak relationship, and the prediction would definitely be better than a random guess. I tried the FBCR to see if a non-linear model could explain the relationship, but the result of FBCR was worse, which could be due to the small sample size of the internet variables.

Keeping the above in view, the answer to my research question would be that there is no significant correlation between the median household income per census tract in California with internet speed variables. Hence, internet speed variables cannot be used to predict the median household income per census tract.

The lack of internet data variables in many places can possibly account for the shortcomings of this approach. Having less data results in incorrect values for the internet speed variables in the census tracts, which can throw off the result. Another possible issue that could have affected the results was the resolution of the internet variables data from Ookla. As mentioned earlier in the data section, the internet speed variables were aggregated at zoom level 16 web mercator tiles. When this data was joined with census tracts, there were many places where the same internet speed variables were part of multiple census tracts. This issue is shown in the image below:

Issues with resolution of Ookla data -- The black lines are the outlines of the zoom level 16 web mercator tiles where the internet speed variables are aggregated, the purple polygons are census tracts near downtown LA

It is difficult to say how exactly these shortcomings can be alleviated; however, since we saw in the hotspot analysis that there is clearly spatial autocorrelation in the variables, it is possible that a more complex model can explain the relationship. Also, another option to try would be neural networks. Although they are non-spatial models, they can be useful in modeling complex relationships.

Acknowledgments

This Story Map is created to satisfy the project requirements of SSCI 575: Spatial Data Science taught by Prof. Orhun Aydin.