School Quality and Housing Prices in L.A. County

Does public school quality affect local housing prices in Los Angeles County?

Xavier Padilla and Yousef Al-Salman

August 9, 2024

View of downtown Los Angeles in bright daylight with the Griffith Observatory in the foreground.

Introduction

"Location, location, location!"—it's the central tenet of real estate, and for good reason! But what truly makes one location stand out from another in a city as diverse as Los Angeles? While the LA housing market has earned its reputation as one of the priciest in the nation, we're curious about what sets certain neighborhoods apart, beyond just geography. Could it be the quality of nearby schools, especially high schools, that draws people in? We believe that the allure of a top-notch school district plays a significant role in homebuyers' decisions, and that the quality of these schools often reflects the income level of the surrounding area. This, in turn, could be a key factor in driving up home prices. To put our theory to the test, we've mapped the data to uncover just how much influence these factors have on determining the value of a home.

Data Preparation

Public School Data

First, we needed public school data. We obtained testing data for public schools in the State of California from the California Assessment of Student Performance and Progress ( https://caaspp-elpac.ets.org/caaspp/researchfiles/sb_ca2023_all_csv_v1.zip ). The documentation was located here: https://caaspp-elpac.ets.org/caaspp/ResearchFileListSB?ps=true&lstTestYear=2023&lstTestType=B&lstCounty=00&lstDistrict=00000

We started by importing the data into R Studio, then formatted the variable names in upper camel case and defined the data types within a dataframe. We also included metadata into this dataframe that identified when and where the data was accessed from.

# Load the necessary libraries
library(readr)
library(rio)
library(tidyverse)
library(janitor)
library(stringr)
library(dplyr)

# Start with the Test Data from https://caaspp-elpac.ets.org
# Import full data set for California and clean names
StatewideFile <- read_delim("Data/sb_ca2023_all_csv_v1.txt", 
                            delim = "^", escape_double = TRUE, na = "NA",
                            col_types = cols(`County Code` = col_number(), 
                                             `District Code` = col_number(), `Total Tested at Reporting Level` = col_number(), 
                                             `Total Tested with Scores at Reporting Level` = col_number(), 
                                             `Mean Scale Score` = col_number(), 
                                             `Percentage Standard Exceeded` = col_number(), 
                                             `Percentage Standard Met` = col_number()), 
                            trim_ws = TRUE) |>
                  clean_names("upper_camel")

# Adding metadata as attributes
attr(StatewideFile, "source") <- "https://caaspp-elpac.ets.org/caaspp/researchfiles/sb_ca2023_all_csv_v1.zip"
attr(StatewideFile, "accessed_on") <- "2024-07-15"

# Accessing metadata
StatewideFile_source <- attr(StatewideFile, "source")
StatewideFile_accessed_on <- attr(StatewideFile, "accessed_on")

cat("Source:", StatewideFile_source, "\n")
cat("Accessed on:", StatewideFile_accessed_on, "\n")

We focused exclusively on public high schools in Los Angeles County, filtering the dataset to include only math scores, as we hypothesized that high-performing schools would excel in STEM programs. We then selected the relevant variables and removed observations identified as state, counties, and school districts to concentrate on school-level data. All student demographics were included without exclusions based on race, gender, or other factors, ensuring a comprehensive analysis. Charter schools were excluded because we wanted to focus on traditional public high schools only. Schools missing test scores were removed from the dataset, as these observations would not contribute to our analysis. We also ensured that all School Codes were consistently 7 digits long to accurately construct the CDSCode—a unique 14-digit code that identifies schools within the California Department of Education’s system. Finally, the CDSCode was prepared for a subsequent join operation with another dataframe later on.

# Get only the LA county school quality data from the Statewide File
DataSchoolQuality <- StatewideFile |>  
  filter(TestId == 2) |> # for math test scores
  select(CountyCode, DistrictCode, SchoolCode, 
         StudentGroupId, TestType,
         TotalTested = TotalTestedAtReportingLevel,
         Grade, MeanScaleScore, 
         PercStdMet = PercentageStandardMet, TypeID = TypeId) |>  
  filter(SchoolCode != 0) |> # excludes state, counties, and districts
  filter(StudentGroupId == 1) |>  # all student demographics
  filter(Grade == 11) |> # highschool test results only
  filter(TypeID == 7) |>  # 7 = School, 9 = Direct Funded Charter School, 10 = Locally Funded Charter School
  filter(MeanScaleScore != "NA") |> # Drop missing test scores
  filter(CountyCode == "19") |>  # 19 = L.A. County
  mutate(SchoolCode = str_pad(SchoolCode, width = 7, pad = "0")) |>  # Ensure 7-digit SchoolCode
  mutate(CDSCode = paste0(CountyCode, DistrictCode, SchoolCode))

Next, we get the addresses of the schools from the California Department of Education ( https://www.cde.ca.gov/schooldirectory/report?rid=dl1&tp=txt ). The documentation for the data set was found here: https://www.cde.ca.gov/ds/si/ds/fspubschls.asp .

We begin by importing the data, formatting the data types and converting variable names to upper camel case, as before. Again, we included metadata into this dataframe that identified when and where the data was accessed from.

# Import full data set for California and clean names
PubSchls <- read_delim("Data/pubschls.txt", 
                       delim = "\t", escape_double = FALSE, 
                       col_types = cols(OpenDate = col_datetime(format = "%m/%d/%Y"), 
                                        Latitude = col_double(), 
                                        Longitude = col_double(), 
                                        LastUpDate = col_datetime(format = "%m/%d/%Y")), 
                       na = "NA", trim_ws = TRUE) |>
            clean_names("upper_camel")

# Adding metadata as attributes
attr(PubSchls, "source") <- "https://www.cde.ca.gov/schooldirectory/report?rid=dl1&tp=txt"
attr(PubSchls, "accessed_on") <- "2024-07-23"

# Accessing metadata
PubSchls_source <- attr(PubSchls, "source")
PubSchls_accessed_on <- attr(PubSchls, "accessed_on")

cat("Source:", PubSchls_source, "\n")
cat("Accessed on:", PubSchls_accessed_on, "\n")

Next, we selected the variables of interest and renamed SOC and SOCType (after converting to upper camel case), as well as Lat and Long (short for latitude and longitude). We filtered the address data to include only high schools, K-12 schools, and "High Schools in 1 School District (Public)." Additionally, we excluded a small number of homes and hospitals, reasoning that these non-traditional institutions are unlikely to significantly impact home prices. We assumed that the presence of hospitals and homes functioning as high schools does not attract enough home buyers to influence property values in their immediate vicinity.

# Next we want only highschools and not other types of schools (like continuation schools)
# Filter the dataframe based on the "SOC" column values, excluding hospitals and homes
DataPubSchlsAdr <- PubSchls|>
               select(CDSCode=CdsCode, StatusType, County, School,
                      Street, City, Zip, SOC = Soc, SOCType = SocType, EdOpsCode, 
                      EdOpsName, Virtual, Lat=Latitude, Long=Longitude) |>
               filter(SOC %in% c(65, 66, 67)) |> # filter to include only high schools, K-12, and high schools in 1 district
               filter(EdOpsCode!="HOMHOS") # exclude homes and hospitals

Finally, we perform a join operation to combine the testing data with the school address data. An inner join operation was chosen so that we would automatically drop any unmatched data. The data is exported from R Studio as a CSV file that is then imported as a layer into ArcGIS Online ("LayerSchoolQuality").

# Join the test data and address data
DataSchoolFinal = inner_join(DataSchoolQuality,DataPubSchlsAdr, 
                             by=join_by(CDSCode==CDSCode))

# Export the combined file
export(DataSchoolFinal, file = "Data/DataSchoolFinal.csv")

Housing Data

Housing data was provided to us by Dr. Carsten Lange in a CSV file with 4,200 recently sold houses in Los Angeles County. A map layer was created from this file after it was imported into ArcGIS Online ("DataHouses4200").

Median Income Data

Median income data was provided by Esri using ArcGIS Online's "Living Atlas" feature. The data includes the median household income in the United States categorized by state, county, and census tract boundary. We filtered the map to show only census tracts in Los Angeles County. The map below shows the data classified into five classes along natural breaks in median income.

Median Household Income Classified By Census Tract Boundaries

Importing the Data into a Map

Importing Housing Data and School Data

Before we could understand the relationship between home prices and school quality, we imported these layers into ArcGIS Online. This was done by searching for and adding the layers through "My Organization". The previously prepared school quality data was imported into ArcGIS and saved as "LayerSchoolQuality".

Importing School Quality Layer into ArcGIS

The housing data provided by Dr. Lange was named "DataHouses4200".

Importing Recently Sold Houses into ArcGIS

Combining Housing Data and Median Income

To estimate household income for each house in our data, we assigned the median income of its corresponding zip code. This approach provides a general understanding of household income based on location. We used ArcGIS Online's "Overlay Layers" feature in the "Manage Data" section under the Analysis tab to accomplish this.

Overlay Features to combine housing and median income data

Each large blue circle now indicates a house with the corresponding median household income.

Combining Housing Data and School Data

We had both the house and school data as layers but needed to match each house with its closest school to analyze their relationship. We used the "Find Closest" feature in the "Use Proximity" section of the Analysis tab, assigning each house to a school based on line distance.

Assigning each house to the school with the shortest line distance

Each pink dot represents a school and the lines drawn represent the straight-line distance from each house and its associated closest school.

Having established a relationship between each house and its closest school, we joined the tables to create a new dataset that includes details on each house and its nearest school. We accomplished this using the "Join Features" tool in the "Summarize Data" section of the Analysis tab. The house and median income layer served as the Target Layer, while the connecting lines layer, which contains the distance from each house to its nearest school, was the Join Layer. The common field used for matching in both tables was the Assessor's Identification Number (AIN), a unique identifier for properties. This process resulted in a final dataset that includes house and income information, along with details about the closest school and its quality.

Combining housing, median income, and school data

Each blue dot now includes house information as well as school information.

Combining Housing Data and School Data

With the final layer complete, we exported the dataset as an Excel file called "HousesWithIncomeAndSchoolQualityYA.xlsx". This dataset will be imported into RStudio in the next step.

Visualizing the Complete Dataset

Initially, we wanted to visualize the data that we are working with on the map. The map below shows the complete dataset. We can easily see that expensive houses are clustered along the coast, in some Foothill communities, and in the Valleys. Low cost houses tend to be further inland in the Inner City, Inland Empire, Lancaster, and Palmdale. Generally, more expensive (more red) houses, are associated with higher quality schools (also more red), and are also in higher median income tracts (more green). The next steps will attempt to quantify the effect on house price from each attribute, on average.

Map showing recently sold houses, median income, and school quality

Regression Analysis with RStudio

Importing the Data Into RStudio

The first step was to import the complete dataset, format the column names as upper camel case, select variables of interest, and look for any missing values. To address potential heteroskedasticity due to high-priced outliers in the data, we applied a logarithmic transformation to the "Price" variable, creating a new variable "lPrice". We found that 7 observations were missing median income values, so these houses were dropped since they were not useful in the regression. The resulting dataset was saved into a new dataframe.

# Load necessary libraries
library(tidyverse)
library(tidymodels)
library(rio)
library(janitor)
library(ggplot2)
library(corrplot)

# Import the dataset and clean column names
DataHousesSchools <- import("Data/HousesWithIncomeAndSchoolQualityYA.xlsx") |>
  clean_names("upper_camel") |>
  mutate(lPrice = log(Price)) |> # log transform of the Price
  select(AIN = Ain, HouseStreet = Street, HouseCity = City, Price, lPrice, 
         Bedrooms, Bathrooms = BathRooms, Sqft,
         MedianHouseholdIncome = MedianHouseholdIncomeInPast12MonthsInflationAdjustedDollarsToLastYearOf5YearRange, 
         SchoolMeanScaleScore = LayerSchoolQualityMeanScaleScore, 
         SchoolPercStdMet = LayerSchoolQualityPercStdMet,
         CDSCode = LayerSchoolQualityCdsCode, 
         SchoolName = LayerSchoolQualitySchool, 
         SchoolStreet = LayerSchoolQualityStreet, 
         SchoolCity = LayerSchoolQualityCity) 

# Look for missing values in the dataset. Found 7 missing median income.
DataHousesSchools |> 
  summarise_all(~ sum(is.na(.)))

# Remove rows with missing values
DataHousesSchoolsClean <- DataHousesSchools |> drop_na(MedianHouseholdIncome)

We examined the histogram plots of Price and lPrice to determine the most appropriate variable for analysis. As anticipated, Price exhibited significant right skewness, largely influenced by the presence of very expensive homes. In contrast, the distribution of lPrice was much closer to a normal distribution. Therefore, we opted to use the log-transformed version of house prices (lPrice), a common practice in housing market analyses. This transformation helps mitigate issues of heteroskedasticity in the regression residuals, leading to more reliable and interpretable results.

Comparison of the Distributions of Price and lPrice

Examining the Correlation Matrix for Multicollinearity

We examined the correlation matrix to assess multicollinearity among our key variables: log Price, Square Footage (Sqft), Median Household Income, and School Mean Scale Score. Specifically, we wanted to ensure that no pairs of variables were highly correlated (i.e., correlation coefficients greater than 0.7), which could indicate multicollinearity and potentially distort the results of our regression analysis. This step was crucial to validate the independence of our predictors, ensuring the robustness of the subsequent regression analysis.

# Calculate the correlation matrix
correlation_matrix <- cor(DataHousesSchoolsClean[, 
                                                 c("lPrice", "Sqft", 
                                                   "MedianHouseholdIncome", 
                                                   "SchoolMeanScaleScore")], 
                          use = "complete.obs")

# Visualize the correlation matrix with values included
corrplot(correlation_matrix, method = "color", 
         tl.col = "black", tl.srt = 45, 
         addCoef.col = "black", number.cex = 0.7)

Correlation plot of Price, Square footage, Median Household Income, and School Mean Scale Score

The strongest correlation was observed between median household income and house price. All correlation coefficients were within the acceptable limit.

Defining the Model using TidyModels

We proceeded by splitting the dataset into training and testing subsets to evaluate the accuracy and generalizability of our predictive model. This step is crucial for assessing how well our model performs on unseen data, thereby preventing overfitting and ensuring that the model can make reliable predictions. We used a 70/30 split, where 70% of the data was allocated to the training set, which the model uses to learn the relationships between variables, and 30% was set aside as the testing set, which we will use later to evaluate the model’s predictive accuracy. The split was stratified by lPrice to maintain consistent distributions across both subsets.To ensure reproducibility of the split, we set a random seed.

# Split into training and test dataset. Set seed for reproducibility.
set.seed(876)
Split7030 <- initial_split(DataHousesSchoolsClean, prop = 0.7, strata = lPrice)
DataTrain <- training(Split7030) # 70% goes to training
DataTest <- testing(Split7030) # 30% goes to testing

Next, we defined the model’s structure using Tidymodels, a robust framework for building and evaluating machine learning models in R.

Recipe Definition: We created a recipe that specifies how the data should be preprocessed before modeling. In our case, the recipe indicates that the target variable is the home price (Price), and the predictors are the mean school scale score (SchoolMeanScaleScore), square footage (Sqft), and median household income (MedianHouseholdIncome). This step allows for consistent data preparation, ensuring that all transformations are applied uniformly across the training and testing datasets.

# Define the recipe for Tidymodels
RecipeHousesSchools <- recipe(lPrice ~ SchoolMeanScaleScore + Sqft + 
                              MedianHouseholdIncome, data = DataTrain)

Model Definition: We specified a linear regression model (linear_reg) for predicting the price of homes. Linear regression was chosen due to its simplicity and interpretability, making it a good starting point for understanding the relationships between the predictors and the target variable.

# Define the model for Tidymodels
ModelDesignHousesSchools <- linear_reg() |> 
                            set_engine("lm") |> 
                            set_mode("regression")

Workflow Definition: Finally, we combined the recipe and model into a workflow. The workflow ensures that all preprocessing steps defined in the recipe are applied to the data before fitting the model.

# Define the workflow for Tidymodels and fit to the training data
WFModelHousesSchools <- workflow() |>  
                        add_recipe(RecipeHousesSchools) |> 
                        add_model(ModelDesignHousesSchools) |> 
                        fit(DataTrain)

Regression Analysis Results

We ran the regression model on the training dataset and examined the results. We anticipated that all of our explanatory variables—School Mean Scale Score, Square Footage, and Median Household Income—would be positively correlated with Price. This expectation is based on the intuition that higher values of these variables would generally correspond to an increase in home prices in the surrounding area.

# View the results of the regression on the training set
ResultsTidy <- tidy(WFModelHousesSchools)
print(ResultsTidy)

Regression results based on the training dataset

The regression results confirmed our assumptions, showing that each of the explanatory variables positively correlates with home prices. Additionally, we were able to assess the statistical significance of each variable’s contribution to the model. The p-values indicate that all variables are highly significant predictors of home prices, with very low p-values far below the typical threshold of p = 0.05.

The coefficients (in the estimate column) suggest that, on average, holding other factors constant:

A one-unit increase in the School Mean Scale Score is associated with a percentage increase in home price of approximately 0.145%.
An additional square foot of space is associated with a percentage increase in home price of approximately 0.0296%. This translates to a proportional increase in the price, reflecting how additional square footage impacts the home’s value.
A $10,000 increase in Median Household Income corresponds to an approximate percentage increase in home price of 4.08%.

Next, we looked at our predictions side by side with our test data to see how well the model predicted "lPrice" in the test data.

# Evaluate the model on test dataset. Compare prediction with actual.
DataTestWithPred <- augment(WFModelHousesSchools, DataTest) |> 
                    select(lPrice, .pred, .resid, HouseStreet)

# View the first 10 values
head(DataTestWithPred, n = 10)

Test dataset augmented with predictions from the model, first ten observations

In the results above, "lPrice" is the actual transformed sale price of houses in the test dataset, ".pred" is our model's prediction of lPrice, and ".resid" is the residual (or the difference between the actual "lPrice", and the predicted price). We included the street address so that we could lookup houses on the internet to evaluate outliers.

After fitting the model, we evaluated its performance using the test dataset to see how well it predicts home prices. We used three common metrics: Root Mean Square Error (RMSE), R-squared (R²), and Mean Absolute Error (MAE). Each of these metrics provides a different perspective on the model’s accuracy:

# Evaluate model performance
Metrics <- DataTestWithPred |> 
           metrics(truth = lPrice, estimate = .pred)
print(Metrics)

Model metrics including root mean square error, R-squared, and mean average error

Root Mean Square Error (RMSE): The RMSE value is approximately 0.345. This metric measures the average magnitude of the prediction errors, giving more weight to larger errors. A lower RMSE indicates a better fit, meaning the model’s predictions are closer to the actual prices. In our case, the RMSE being relatively larger than the MAE is telling us that we have some large errors (outliers) in the data. This is an expected result, since we know that our dataset has a relatively small amount of very expensive homes in it. It was for this reason that we did a log transform of the data.
R-squared (R²): The R² value is approximately 0.457. This metric represents the proportion of variance in the home prices that is explained by the model. An R² value of 0.457 suggests that about 45.7% of the variability in home prices can be explained by just the three variables in our model. While this indicates that the model captures some of the relationship between the predictors and the target variable, there is still a significant portion of variance that is not accounted for by the model. Other factors affecting home price could be demographics, local amenities, proximity to the beach, or many other reasons that folks purchase homes.
Mean Absolute Error (MAE): The MAE value is approximately 0.270. This metric represents the average absolute difference between the predicted and actual home prices. Like RMSE, a lower MAE indicates better model accuracy, but it is less sensitive to large errors. because MAE is a linear measure of error that treats all errors equally, regardless of their magnitude. It represents the average absolute difference between predicted and actual values.

In our case, with RMSE at approximately 0.345 and MAE at approximately 0.270, the RMSE is a little bit larger than the MAE. This difference suggests that while the model performs reasonably well on average, there are some instances where the prediction errors are large, indicating the presence of outliers or situations where the model’s predictions are far off from the actual values. In particular, the model may not predict the sale price of very expensive houses very well.

We looked further at basic summary statistics of the testing data to make better sense of the model performance.

summary(DataTestWithPred)

The log-transformed prices range from approximately 12.26 to 14.86. These values are on the log scale and indicate the spread and central tendency of the log-transformed prices.

The predicted log prices are slightly higher than the actual log prices. This is normal in many models, and the values are close to the actual log prices, suggesting that the model predictions are reasonably accurate. We will evaluate the predictions for systematic bias further ahead.

Residuals show the differences between the actual and predicted log prices. The median residual is very close to zero, indicating that the predictions are generally accurate. The range of residuals from -1.264 to 1.109 shows that while most predictions are close to the actual values, there are some cases with larger errors.

To further understand the error in our model we plotted the residuals vs. the predicted values.

# Plot residuals
ggplot(DataTestWithPred, aes(x = .pred, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Residuals vs Predicted Values",
       x = "Predicted Values",
       y = "Residuals") +
  theme_minimal()

The primary goal of plotting residuals against the predicted values is to assess the performance and assumptions of the model, particularly the assumption of homoskedasticity (constant variance of errors) and to detect any patterns in the residuals. Ideally, the residuals should be randomly scattered around zero with no discernible pattern. If the residuals fan out (i.e., their variance increases or decreases as the predicted values increase), this indicates heteroskedasticity, suggesting that the model’s error variance is not constant.

In our case, we see that the values do not fan out and appear to be randomly distributed indicating that the model's variance of residuals is uniform across the range of predicted values. This shows that by performing the log transform and removing heteroskedasticity, our model assumptions were better met, which can lead to more reliable coefficient estimates and statistical inferences. Predictions made by our model are likely to be more consistent across different levels of the predictor variables. In particular, this indicates that the model predictions are not systematically over or underestimating the price of houses.

Further analysis of the residuals reveals that they are indeed approximately normally distributed.

# Histogram of residuals
ggplot(DataTestWithPred, aes(x = .resid)) +
  geom_histogram(binwidth = 0.1, fill = "blue", color = "black", alpha = 0.7) +
  labs(title = "Histogram of Residuals",
       x = "Residuals",
       y = "Frequency") +
  theme_minimal()

Final Thoughts

This study demonstrated that school quality, median household income, and the square footage of a house are significant predictors of house prices. By incorporating just these three variables, we were able to develop a model that provides a reasonably accurate estimate of house prices. The statistical significance of each variable underscores their importance in the housing market, particularly within the context of Los Angeles County.

The integration of spatial data through ArcGIS was instrumental in accurately linking housing data with corresponding school quality metrics and neighborhood income levels. By visualizing these attributes on a map, we gained valuable insights into the geographic distribution of these factors and their influence on home prices. This spatial approach not only enhanced our analysis but also demonstrated the practical application of geographic information systems in real estate research.

Overall, the study confirms that high-quality schools and affluent neighborhoods contribute positively to home values, while also validating the effectiveness of using a log-transformed price variable to mitigate issues such as heteroskedasticity. The findings highlight the importance of considering both physical and socioeconomic factors in real estate valuation .