That Sinking Feeling

Land Subsidence and Groundwater in the San Joaquin Basin, CA


Introduction

Groundwater is an important resource in the San Joaquin Valley (i.e. Central Valley), California. The unique thing about groundwater, outside of it be a relatively abundant resource, is that it has the tendency to hold grains of rock and sediment part, making the ground elevation inflate. The extraction of water (and other fluids) from the subsurface eliminates the holding of such pore spaces open. And since air, but not water, can be compressed, this results in the rock grains and sediment moving closer together. Collapsing such pore space results in the ground above sinking. This is known as land subsidence.

 In California, large areas of land subsidence are well documented in the first half of the 20th century. Most of California’s subsidence is a result of excessive groundwater extraction. Completion of California's State and Federal water projects that bring water from California's wet north to its dry south allowed some groundwater aquifers to recover, and subsidence decreased in these areas. Subsidence, particularly in the San Joaquin Valley, continues today at nearly historically high rates of more than 1 ft/yr (USGS 2021).

So what are some of the explanatory variables that may link groundwater depth to subsidence? This study aims to establish such associations. But before looking at the variables of this study, we need to understand some of the physical phenomena associated with land subsidence.


Figure 1. Blue represents water within the pore space of the earth material (e.g. rock, sediment, or soil). A) Pore space shown in a well sorted earth material. B) Pore space in a poorly sorted earth material. C) Pore space in a poorly sorted earth material with natural mineral precipitation blocking the pore space flow paths.

The percentage of pore (open) space in rock, sediment, and soil is called porosity. Porosity is dependent on the size and shape of the grains, how well they are sorted, and how tightly they are packed. Porosity determines how much groundwater can be stored within the rock, sediment, or soil as shown in Figure 1.

The ground sinks when water is pumped from the pore space faster than natural recharge can replace it. In particular, this occurs when sediment is unconsolidated and has high clay content. Figure 2 exhibits land subsidence between 1965 and 2017 near Merced, CA.

California's San Joaquin Basin has an annual overdraft of ~1.5 million acre-feet (PPIC 2020).


Figure 2. Photos courtesy United States Geological Survey (USGS). Photos were taken just south of Merced in December 2017 show how land has dropped 8.6 feet between 1965 and 2016. The rate of subsidence was even greater from 1988 to 2016 when 6.2 feet of the loss in elevation occurred.

Study Area

Figure 3. San Joaquin Valley groundwater basins that are at the heart of this study. Note that these are the key areas in the Valley with land subsidence associated with groundwater extraction.

The San Joaquin Basin is a Mediterranean environment with long, hot, and dry summers and wet winters. It lies north of the Transverse Range and south of the Sacramento Valley. The San Joaquin Basin produces close to 13% of the United States’ agricultural products which include grapes, raisins, cotton, almonds, citrus, and a plethora of vegetables (CDFA 2010). Each of these crops are water intensive, so pumping groundwater in the Central Valley to water crops is not uncommon. The image to the left shows the approximate location of the San Joaquin Valley and the subbasins which are the areas of interest (AOI) for this study.

Data Table

Table 1 outlines the datasets utilized, inputs, criterion each was used for, and the source of the datasets. Most of these datasets have a genesis from the California Sustainable Groundwater Management Act (SGMA) Data Viewer. In the instance of this study, most data were combined via spatial joins in ArcGIS Pro for subsequent processing using R statistical libraries.

Table 1. Data table containing columns for dataset names, criterion these datasets were used for, and the name of the agency/source each dataset was derived from.


Methodology

The Kolmogorov-Smirnov test (KS test) was used to test for normality associated with the subsidence data in the San Joaquin Valley. Note that the p-value is less than 0.05 implying the data are not normally distributed and require a transformation before proceeding with further data analyses. The KS test was run for subsidence, groundwater depth, and area (the key dependent and independent variables in this study).

Upon assessment of the distribution of the key variables, exploratory data analysis (EDA) was conducted to further assess the condition of the data and their implication on subsidence. This includes a scatterplot of subsidence over groundwater, a scatterplot of well total depth over groundwater depth, boxplots of groundwater depth and subsidence per subbasin, and even a 3D plot of groundwater depth throughout the San Joaquin Basin.

Exploratory spatial data analysis (ESDA) was then performed on the same datasets. This included Neighbor Relationship Exploration, Moran's I for Spatial Autocorrelation, apply spatial weights for Moran's I, and a hypothesis test to assess if subsidence rates are randomly distributed. These were conducted at the well level and at the subbasin level, afterwhich a Monte Carlo Simulation was conducted and a Moran's I Plot generated (Wang et al 2022).

Chu et al (2020) and others (Faunt et al 2016 and Poland 1972) have all suggested that a root-mean-square-errors (RMSEs) can be used to estimate non-linear and spatially varying subsidence. Such an attempt was on the focus of this study, but rather, an attempt at fitting a linear regression model was made by EDA, ESDA, and the fitting of residuals through Spatial Lag and Spatial Error Models.


Results and Discussion

This study has assessed the use of SLM and SEM in relation to residuals for datasets related to landsubsidence in the Central Valley, CA. Groundwater depth, well depth, and basin area are all independent variables that have impact on the dependent variable of land subsidence. When looking at the influence of spatial lag and spatial error, both are present across the variables. This implies that several things need to be executed to handle the introduction of biases due to spatial autocorrelation within this study. While the overall performance of each spatial model does not vary much from the previous non-spatial model, the results do show further evidence of both spatial lag and spatial error biases introduced into the study. An additional thing to note about non-spatial vs. spatial model assessment is that when incorporated SEM and SLM, even the values associated with the Global Moran Statistic greatly improved, but still demonstrated strong positive values alluding to positive spatial autocorrelation.  

Exploratory Data Analysis

ArcGIS Web Application


Exploration of the distribution of the data and variables was done with descriptive statistics. This provided insight to the sample size, minimum, mean, median, maximum, and standard deviation. The outputs resulted in scatterplots, histograms, and boxplots for the primary variables of the study.

Figures 4, 5, and 6 all show EDA results from the datasets utilized in this study. Figure 4 is a great graphic on how groundwater depth and total well depth track eachother. The central tendancy for these data tends to be around 350' below ground surface (bgs). The central tendancy of the total well depth tends to be around 680'.

Figure 4. Scatterplot of total well depth over groundwater depth. Based on this plot, there are several points that are outliers. Note how groundwater depth and total depth of the wellbore tend to be concentrated between 0-500’ (groundwater depth) and 0-1500’ (total depth). This connection implies that most groundwater extraction is relatively shallow and thus subsidence may be more prominent at shallow depths. It is also cost effective to drill shallower wells.

KS Test for Normality

The Kolmogorov-Smirnov test (KS test) was used to test for normality associated with the subsidence data in the San Joaquin Valley. Note that the p-value is less than 0.05 implying the data are not normally distributed and require a transformation before proceeding with further data analyses. Table 2 outlines KS test results for subsidence, groundwater depth, and area (the key dependent and independent variables in this study).

Table 2. KS test for normality associated with the subsidence data in the San Joaquin Valley. Note that the p-value is less than 0.05 implying the data are not normally distributed and require a transformation before proceeding with further data analyses.

Figure 5. (left) Histogram of subsidence values in the San Joaquin Basin. Note the right skewed tendency that alludes to a non-normal distribution. (center) Histogram of groundwater depth data in the San Joaquin Basin. Again, a slight left skew exists for these data and ultimately required a log transformation for normalization. (right) Histogram of well depth in the San Jaquin Basin. Note the left skew and non-normal distribution of these well depths.

Figure 6. (left) Boxplot of groundwater depth data per subbasin within the San Joaquin Valley. Note how the Panoche Valley subbasin’s top of groundwater is much deeper than the rest of the larger basin. This is likely attributed to is topographic location in the Diablo Range, as well as the fact that most geologic strata dip into the valley (i.e. dip SE. (right) Boxplot of subsidence data per subbasin with the San Joaquin Valley. Note how the Tule and Kaweah subbasins have a larger range of subsidence. Most subsidence values for the San Joaquin Valley appear to be between -0.5 and 0.25, where are likely linked to subsidence from groundwater extraction and uplift from groundwater recharge and plate tectonics.

Figure 7. (left) Histogram of log transformed groundwater depth data in the San Joaquin Valley. A log transformation made these data have a central tendency to be around 5.75 while still being heavily tailed to the left. (right) Histogram of log transfomred maximum subsdience between 2010 - 2021. The resulting distributions shows that even a standard log transform does not yield a normal distribution. Further work needs to be done to allow further work with non-linear, non-normally distributed data.

Neighborhood Relationship Exploration and Moran's I

Table 3. Neighborhood assessment for subbasins in the study's AOI.

An output of all contiguous polygons that share just edges (as based on rook’s case) yielded Table 3. This ultimately lets one know the number of neighboring polygons. This level of aggregation greatly limits n in the study--even to the point that the number of values to assess is not statistically significant.

Table 4. Neighborhood assessment for Thiessen polygons in the study's AOI.

In the end this was also assessed for each Thiessen ploygon within the dataset (~2914 wells vs. 11 subbains). The neighborhood relationship for each well's Thiessen polyon is shown in Table 4. Figure 8 shows the neighborhood relationship among individual wells through the San Joaquin Basin. The average number of links greatly limits the neighborhood association at this scale of assessment.

Figure 8. The black dots represent wells within their respective subbasins and the red lines denote the neighbors of each well. When viewed as this level, it is easier to see that some wells have neighbors that are several 1000’s of feet away and show the spatial continuity (spatial autocorrelation even) of these sandy-gravel aquifers. Aggregating these data at a basin level makes it more difficult to observe the interconnection of each well within the aquifer.

Moran's I

Here the data tend to plot in Quadrant I, as displayed in Figure 9. This implies high values associated with high values (or a high-high scenario). At the same time, three different groupings may be observed: one that is in the range of 0-500’, another that is between 1000-1500’, and one that is >1500’. In looking at this, and being familiar with the San Joaquin Basin, one might infer that the larger values are once again associated with the Panoche Valley subbasin. This may also be wherein large averages tend to pull the surrounding neighbors up. It may also be noted that those very few values that plot in Quadrant II are likely outliers that result from the average of neighbors being much higher and cause some spatial randomness therein (again, this is in essence ssociated with the spatial lag).

Figure 9. Moran plot of groundwater depth per well's Thiessen polygon in the San Joaquin Basin.

When a Global Moran's I statistic is deteremined for land subsidence per subbasin, Figure 10 is created. Here the data tend to plot in Quadrant III, which would imply spatially clustered low values associated with similarly low values (or a low-low scenario). One might want to be careful in interpreting this plot though, as negative (or low values) imply larger amounts of subsidence. Those values in Quadrant I (high-high) are closer to being outliers as there are few subbasins that have seepage at the surface of their groundwater aquifers. However, the values are rather dispersed noting a range of subsidence but in clusters. This too makes sense as groundwater extraction tends to occur inside sandy-gravel aquifers within the San Joaquin Valley. The spatial extent of these aquifers have more or less been mapped and are targets for drilling more groundwater production wells. What does this result in? Further subsidence and a drive to push more values to be plotted in Quadrant III.

Figure 10. Moran plot of subsidence per subbasin in the San Joaquin Basin.

Table 5 shows the Moran I statistic on a per well basin throughout the basin. With a statistically significant p-value, these data are spatially random but with a positive Moran I stastistic, also demonstrates strong positive spatial autocorrelation. In this case, similar values are clustered close together. This too makes geological sense. Similar water depths in a region will be clustered as will vertical changes due to subsidence.

Table 5. Output of Moran’s I test under normality. Here the Moran’s I statistic is positive and suggests that the data (in this case the wells and the basins that they reside in) are spatially clustered. Had the Moran’s I statistic been negative, then there would have been competitive clustering (wherein values repel each other).

Monte Carlo Simulation

With the results from the Global Moran's I run, a Monte Carlo simulation approach yielded varying results. Figure 9 is taking a high-level look at aggregated data and has summed up the amount of subsidence between 2010 and 2021. This results in a “maximum subsidence” map for the same select subbasins. However, this does not come without risk. To further demonstrate the dangers of aggerating these data, Figure 5 shows three snapshots of the permutations when the number of simulations was set to 999.

Figure 9. (left) Observed subbasins as they undergo random testing associated with Monte Carlo Simulation. Again, the changes from one moment to the next, although random, show how drastically values can change for subsidence when aggregated at the subbasin level. (right) Density plot of permutations at the subbasin level. This is associated with subsidence values averaged within the subbasins. Note that this plot implies a negative Moran’s I statistic which suggests a dispersed pattern or random pattern. In this case, the Moran’s statistic was both negative and the values negative (as subsidence is).

Each of the figures in this section have specifics mentioned about them in their accompany text boxes; however, the underlying message is clear: there is spatial clustering at the well level, as designated by the Moran’s statistic as well as the associated p-values. At the subbasin level, observing spatial clustering and accompanying neighbors can be much more difficult. This is a result of MAUP and needs to be addressed before a basin-level study can be improved.

Table 5. Monte-Carlo Simulation of Moran I results. The low p-value shows positive spatial autocorrelation that tends to be spatially random.

Output of Moran’s I test under Monte Carlo Simulation. Here the Moran’s I statistic is even more positive and comes close to being spatially random. Which also makes sense given that each basin may extract a certain amount of water from the subsurface in efforts to constrain land subsidence.

OLS Model

There were three variables that could possibly have impact on this study’s regression model: groundwater depth, well (total) depth, and area (in square miles). These all have some potential impact on making subsidence and soil/sediment compaction calculations. The results of regression model fitting of each of these variables is shown in Table 6.

Table 6. Regression coefficients associated with groundwater depth, well depth, and area as they may relate to land subsidence. Here it would appear that all three variables are impactful when it comes to subsidence. Groundwater depth may have the least impact, but still has a p-value < 0.05 making it statistically significant and should be kept as one of the regression coefficients.

Assessment of Residuals

Both the graphical assessment of the residuals (Figure 12) as well as the results from the Brush-Pagan test (Table 7) demonstrate that there is a significant level of heteroscedasticity. In the case of the BPTest, the p-value is less than the chose significant level of 0.05. As shown on the right side of Figure 12, the data are not completely random as they do not demonstrate an equal distribution of points through the range of X values on the plot. Additionally, the “best-fit” line is non-linear (i.e. it shows character as it attempts to fit the residuals). Ultimately, these results show that we are dealing with non-normal data that cannot fit a linear trend.

Figure 12. (right) Residuals vs Fitted Values plot. This plot demonstrates the relationship between residual values and estimated responses (or fitted values) and was used to assess non-linearity, unequal error variances, and outliers within the dataset. (left) Residual regression model diagnostics demonstrating fit. This model uses all three variables in assessing land subsidence. The Q-Q plot shows that this particular regression model still underpredicts values, but it does better at forming closure in the lower quantiles (i.e. the regression line is closer to fitting the observed data).

Based on the p-value of the BP test, there is apparent heteroscedasticity. Heteroscedasticity often occurs when there is a large difference among the sizes of the observations. In this case, water depth and subsidence levels. As one considers how to adjust, transform, and rectify issues of heteroscedasticity using mean values associated with the residuals. However, this is seldom encouraged and is often thought of as a last result when all other transformations fail. The thought here would be to next try something like a Box-Cox transformation whereby variables are made to approximate a Gaussian Distribution (Sakia 1992).

Table 7. Showing that heteroscedasticity is present in the dataset.

SEM or SLM - that is the question!

In the case of the spatial lag model, Table 9 shows the model performance. Once again two of the three variables are statistically significant. The AIC value is higher than the SEM, which implies that SEM is the better of the two models (albeit both are related to each other). The SLM model was run in conjunction with the Global Moran’s I by which the data show spatial autocorrelation in the dependent variable. The spatial lag model must then be used to assess such autocorrelation while also lending insight to fitting an OLS or MLE model without biases. As such, coefficient sizes may not be properly represented in the model and the influence of errors may be greatly underestimated (Mitchell and Griffin 2021).

The lag value within this model, represented by Rho, shows statistical significance across all tests. This would imply that the spatial lag is not accounted for through the groundwater depth, well depth, and area variables of the model. This means that a spatial lag on the dependent variables is in fact needed. Thus, the spatial error model is needed to help control spatial autocorrelation. The assumption of spatial dependence is confirmed (and also well explained within the realm of geology, but again, that is not the focus of this particular study).

In the case of the spatial error model it is able to calculate, or estimate, a unique value for every applied parameter within the model (Chi and Zhu 2019). It should also be noted that groundwater depth and area tend to be statistically significant and likely have spatial dependence throughout the San Joaquin Basin. Meanwhile, well depth is not statistically significant implying that there is no spatial dependence and is random in the context of the spatial error model. As this study does not look at which portion of the Tulare Formation (i.e. upper or lower Tulare; which is the difference between higher TDS in the upper Tulare and lower TDS, or fresher water, in the lower Tulare) each water production well is completed in, it makes sense that well depth does not have a direct influence on this model and remains spatially random. Table 6 (left) shows the results from SEM.

The last thing to note is that the Lambda value of the SEM is positive and statistically significant. This also implies that there is a need to control for spatial autocorrelation within the model and alludes to spatial dependence (Chi and Zhu 2019). Again, geology and completion depth may be able to explain this but were not within the current scope of this study.

Table 8 (left) Output of the SEM model. Herein the three variables of well depth, groundwater depth (GSE), and area exhibit both statistical significance and the lack thereof espectively. Also noted are the AIC values, that are relatively low, and Lambda values. Both implying the need to control for spatial autocorrelation. Table 9 (right) Output of the SLM model. Herein the three variables of well depth, groundwater depth (GSE), and area exhibit both statistical significance and the lack thereof respectively. It is interesting to see how with the SLM groundwater depth tends to be statistically in significant.

Conclusions

This study intended to find predictors for land subsidence among groundwater depth, well depth, and area. While the regression model generated in this study shows a low adjusted R squared value, the SEM models show an improvementon the standard regression model. Groundwater depth, well depth, and area are all independent variables that have impact on the dependent variable of land subsidence. When looking at the influence of spatial error across variables, there is the need to handle the introduction of biases due to spatial autocorrelation within this study. . An additional thing to note about non-spatial vs. spatial model assessment (or SEM) is that when incorporating SEM, even the values associated with the Global Moran Statistic greatly improved, but still demonstrated strong positive values alluding to positive spatial autocorrelation. Such a correlation makes sense as deeper wells and higher drawdown are linked, resulting in greater land subsidence in a given area. Ultimately, this means the variables identified in this study do infact have an impact in predicting land subsidence.

Limitations

Until spatial autocorrelation better accounted for and the spatially dependent variables are better transformed (perhaps through a Box-Cox transformation) both non-spatial and spatial models will likely continue to perform the same. The final step that will greatly help in better handling spatial autocorrelation, and what was hinted at throughout this study, would be to introduce several additional variables including surface water infiltration index (for aquifer and pore space recharge), but also available pore volumes (as these are directly tied to the area but also would add a volumetric component to the models), and finally, screened intervals between the upper and lower Tulare Formation. Each of these could assist in better model fitting and prediction for spatial autoregressive models associated with land subsidence.

References

California Department of Food and Agriculture (CDFA). 2010. “California Agriculture Statistical Report 2008-2009.”

Chi, G. and Zhu, J. 2019. Spatial Regression Models for Social Sciences. Thousand Oaks, CA: SAGE Publications

Chu, H. J., Ali, M.Z., and Burbey, T. 2020. Development of groundwater-drawdown function to estimate spatially varying land subsidence: a case study of the Choshui River basin, Taiwan. Journal of Hydrology: Regional Studies. Vol. 35, 100808

Faunt, C.C., Sneed, M., Traum, J., and Brandt, J.T., 2016. Water availability and land subsidence in the Central Valley, California, USA. Hydrogeol. J. 24, 675-68

Mitchell, A. The ESRI Guide to GIS Analysis, Volume 1. 2nd Ed. ESRI Press, 2020 Mitchell, A. and Griffin, L.S. The ESRI Guide to GIS Analysis, Volume 2. 2nd Ed. ESRI Press, 2021

Poland, J.F., 1972. Land subsidence in the western states due to ground-water overdraft. Water Resource. Bulletin. 8, 118-13

Sakia, R. M. 1992. "The Box–Cox transformation technique: a review", The Statistician, 41 (2): 169–178

Sustainable Groundwater Management Act (SGMA) Data Viewer. Accessed March 5, 2022: https://sgma.water.ca.gov/webgis/?appid=SGMADataViewer#currentconditions

USGS. 2022. “Areas of Land Subsidence in California.” Accessed March 5, 2022: https://ca.water.usgs.gov/land_subsidence/california-subsidence-areas.html

Wang, Z., Liu, Y., Zhang, Y., Lui, Y.F., Wang, B., and Zhang, G. 2022. Spatially varying relationships between land subsidence and urbanization: a case study in Wuhan, China. Remote Sensing, 2022, 14, 92

Figure 1. Blue represents water within the pore space of the earth material (e.g. rock, sediment, or soil). A) Pore space shown in a well sorted earth material. B) Pore space in a poorly sorted earth material. C) Pore space in a poorly sorted earth material with natural mineral precipitation blocking the pore space flow paths.

Figure 2. Photos courtesy United States Geological Survey (USGS). Photos were taken just south of Merced in December 2017 show how land has dropped 8.6 feet between 1965 and 2016. The rate of subsidence was even greater from 1988 to 2016 when 6.2 feet of the loss in elevation occurred.

Figure 3. San Joaquin Valley groundwater basins that are at the heart of this study. Note that these are the key areas in the Valley with land subsidence associated with groundwater extraction.

Table 1. Data table containing columns for dataset names, criterion these datasets were used for, and the name of the agency/source each dataset was derived from.

Figure 4. Scatterplot of total well depth over groundwater depth. Based on this plot, there are several points that are outliers. Note how groundwater depth and total depth of the wellbore tend to be concentrated between 0-500’ (groundwater depth) and 0-1500’ (total depth). This connection implies that most groundwater extraction is relatively shallow and thus subsidence may be more prominent at shallow depths. It is also cost effective to drill shallower wells.

Table 2. KS test for normality associated with the subsidence data in the San Joaquin Valley. Note that the p-value is less than 0.05 implying the data are not normally distributed and require a transformation before proceeding with further data analyses.

Table 3. Neighborhood assessment for subbasins in the study's AOI.

Table 4. Neighborhood assessment for Thiessen polygons in the study's AOI.

Figure 8. The black dots represent wells within their respective subbasins and the red lines denote the neighbors of each well. When viewed as this level, it is easier to see that some wells have neighbors that are several 1000’s of feet away and show the spatial continuity (spatial autocorrelation even) of these sandy-gravel aquifers. Aggregating these data at a basin level makes it more difficult to observe the interconnection of each well within the aquifer.

Figure 9. Moran plot of groundwater depth per well's Thiessen polygon in the San Joaquin Basin.

Figure 10. Moran plot of subsidence per subbasin in the San Joaquin Basin.

Table 5. Output of Moran’s I test under normality. Here the Moran’s I statistic is positive and suggests that the data (in this case the wells and the basins that they reside in) are spatially clustered. Had the Moran’s I statistic been negative, then there would have been competitive clustering (wherein values repel each other).

Table 5. Monte-Carlo Simulation of Moran I results. The low p-value shows positive spatial autocorrelation that tends to be spatially random.

Table 6. Regression coefficients associated with groundwater depth, well depth, and area as they may relate to land subsidence. Here it would appear that all three variables are impactful when it comes to subsidence. Groundwater depth may have the least impact, but still has a p-value < 0.05 making it statistically significant and should be kept as one of the regression coefficients.

Figure 12. (right) Residuals vs Fitted Values plot. This plot demonstrates the relationship between residual values and estimated responses (or fitted values) and was used to assess non-linearity, unequal error variances, and outliers within the dataset. (left) Residual regression model diagnostics demonstrating fit. This model uses all three variables in assessing land subsidence. The Q-Q plot shows that this particular regression model still underpredicts values, but it does better at forming closure in the lower quantiles (i.e. the regression line is closer to fitting the observed data).

Table 7. Showing that heteroscedasticity is present in the dataset.