Introduction

With a population of approximately 39 million, California continues to symbolize opportunity and innovation, attracting people from across the nation and around the world1. Counties such as the Central Valley, Bay Area, Inland Empire, Los Angeles, and Orange County have experienced notable population surges. However, only five cities within these regions have seen substantial increases, bypassing more traditional urban hubs2. This trend coincides with skyrocketing housing costs—San Francisco’s average home value, for instance, has climbed to an astonishing $1,262,2343.

Historically, California has been one of the most desirable places to live, largely due to its economic opportunities. With the largest GDP in the United States—and the fifth largest globally, at $3.9 trillion in 20234—the state draws individuals seeking diverse job prospects across industries such as technology, entertainment, and agriculture. Yet, these opportunities come at a cost: California faces a high cost of living (HCOL) and a severe housing shortage, particularly in its urbanized areas. As the demand for housing in cities continues to outpace supply, more residents are turning to suburban areas, reshaping the state’s housing landscape and intensifying debates over affordability, equity, and sustainable development.

Becky Nicolaides’s The New Suburbia: How Diversity Remade Suburban Life in Los Angeles after 1945 (2023) provides valuable historical context for this shift. According to Nicolaides, Los Angeles adopted a “suburban identity” by the 1940s to accommodate a growing population fueled by WWII and the Baby Boom. City leaders throughout the county pursued decentralized development, creating what Nicolaides describes as “borderlands, streetcar suburbs, edge cities, and corporate suburbs” (Nicolaides, 2023)5. These initiatives left a blueprint for accommodating those who sought proximity to urban centers without incurring the high costs associated with city living. Today, much of California’s population resides in suburban areas, commuting to nearby urban centers for work, entertainment, and other activities.

The significance of this research lies in its practical application: with housing affordability becoming an increasingly urgent issue, this projects provides insight in navigating California’s complicated housing market. By examining geographical and economic data, this project aims to pinpoint possible locations for buying and owning a home without breaking the bank, all while having ideal traits and examining current housing market trends.

Question

How accurately can housing prices in California be predicted based on geographical factors and regional population dynamics?

Load packages

Here we load the respective packages tidyverse, tidymodels, olsrr, and conflicted to handle cleaning, analysis, modeling, and predicting aspects of the data.

#install.packages("maps")
library(tidyverse)
library(tidymodels)
library(olsrr)
library(conflicted)
library(scales)
library(maps)

conflict_prefer("filter", "dplyr")
conflict_prefer("select", "dplyr")

The Data

The data that we’re examining for the question is an Updated California Housing Prices dataset, which contains information gathered from the US Census Bureau from 2009-2020. The data contains 11 variables, which are name, total_population, total_households, average_household_income, average_house_age, total_rooms, total_bedrooms, median_house_value, longitude, latitude, and ocean_proximity. The original data card can be found on kaggle.com

Data Import

We import the data using read_csv(). The data can be found under the data directory of this repository.

Here are a few summary statistics of the original data import.

summary(ca_housing)
##      name           total_population total_households average_household_income
##  Length:25591       Min.   :    0    Min.   :   0.0   Min.   :     0          
##  Class :character   1st Qu.: 1009    1st Qu.: 348.0   1st Qu.: 66178          
##  Mode  :character   Median : 1421    Median : 475.0   Median : 89021          
##                     Mean   : 1537    Mean   : 512.0   Mean   : 92005          
##                     3rd Qu.: 1940    3rd Qu.: 640.5   3rd Qu.:115610          
##                     Max.   :39373    Max.   :7133.0   Max.   :212500          
##                                                                               
##  average_house_age  total_rooms         total_bedrooms   median_house_value  
##  Min.   : 0.00     Min.   :-666666666   Min.   :   0.0   Min.   :-666666666  
##  1st Qu.:21.60     1st Qu.:      1908   1st Qu.: 376.0   1st Qu.:    298400  
##  Median :26.90     Median :      2639   Median : 514.0   Median :    481400  
##  Mean   :26.51     Mean   :  -3388202   Mean   : 555.3   Mean   : -62367652  
##  3rd Qu.:32.10     3rd Qu.:      3634   3rd Qu.: 694.0   3rd Qu.:    735150  
##  Max.   :45.50     Max.   :     46497   Max.   :7960.0   Max.   :   2000001  
##                    NA's   :34                                                
##    longitude         latitude     ocean_proximity   
##  Min.   :-124.2   Min.   :32.55   Length:25591      
##  1st Qu.:-121.6   1st Qu.:33.93   Class :character  
##  Median :-118.6   Median :34.26   Mode  :character  
##  Mean   :-119.5   Mean   :35.54                     
##  3rd Qu.:-118.0   3rd Qu.:37.68                     
##  Max.   :-114.3   Max.   :41.75                     
## 
ca_housing
## # A tibble: 25,591 × 11
##    name           total_population total_households average_household_income
##    <chr>                     <dbl>            <dbl>                    <dbl>
##  1 Alameda County             1713              644                  150520.
##  2 Alameda County             1322              630                  167222.
##  3 Alameda County              940              417                  171948.
##  4 Alameda County             1043              413                  150914.
##  5 Alameda County             1206              423                  167677.
##  6 Alameda County             1318              775                  122429.
##  7 Alameda County             1076              590                   79110.
##  8 Alameda County             1458              631                  143221.
##  9 Alameda County             1544              706                  124129.
## 10 Alameda County              885              443                  118708.
## # ℹ 25,581 more rows
## # ℹ 7 more variables: average_house_age <dbl>, total_rooms <dbl>,
## #   total_bedrooms <dbl>, median_house_value <dbl>, longitude <dbl>,
## #   latitude <dbl>, ocean_proximity <chr>

Data Wrangling

# Dropping all the rows that contain the value -666666666
ca_housing <- ca_housing[!apply(ca_housing == -666666666, 1, any), ]
skimr::skim(ca_housing)
Data summary
Name ca_housing
Number of rows 23175
Number of columns 11
_______________________
Column type frequency:
character 2
numeric 9
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
name 0 1 11 22 0 58 0
ocean_proximity 0 1 6 10 0 4 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
total_population 0 1 1558.52 712.18 26.00 1035.00 1447.00 1962.00 9793.00 ▇▂▁▁▁
total_households 0 1 518.68 225.45 18.00 355.00 480.00 645.50 3465.00 ▇▂▁▁▁
average_household_income 0 1 95206.56 32698.05 12563.83 69637.62 92172.39 117990.33 212500.00 ▂▇▇▂▁
average_house_age 0 1 26.70 7.56 3.50 21.70 27.10 32.20 45.50 ▁▃▇▇▂
total_rooms 0 1 3018.50 1405.95 73.00 2013.00 2747.00 3744.00 16777.00 ▇▃▁▁▁
total_bedrooms 0 1 561.24 244.37 18.00 382.00 519.00 698.00 3577.00 ▇▂▁▁▁
median_house_value 0 1 630683.07 410368.30 9999.00 355700.00 532300.00 785650.00 2000001.00 ▆▇▃▁▁
longitude 0 1 -119.51 1.97 -124.16 -121.72 -118.66 -117.95 -114.35 ▂▆▅▇▁
latitude 0 1 35.56 2.08 32.55 33.93 34.27 37.69 41.75 ▇▁▅▂▁

Since some of the columns in the dataset (e.g. average_household_income and median_house_value) contain values less than zero, we clean the dataset of those values, as they would throw off any EDA and analysis performed.

mean_income <- mean(ca_housing$average_household_income, na.rm = TRUE)

avg_income_distribution <- ggplot(ca_housing, aes(x = average_household_income)) +
  geom_histogram(fill="#a6bddb", color="#e9ecef", alpha=0.9) +
  geom_vline(aes(xintercept = mean_income), 
             color = "black", linetype = "dashed", size = 1, alpha = 0.8) +
  labs(title = "Distribution of Average Household Income",
       x = "Average Household Income",
       caption = paste("Mean Income:", round(mean_income, 2))) +
  theme_minimal()

show(avg_income_distribution)

In this visualization, we see that the average household income is overall balanced, with a slight right skew. The average household income for all data points in the dataset is $95,206.56.

mean_house <- mean(ca_housing$median_house_value, na.rm = TRUE)

avg_value_distribution <- ggplot(ca_housing, aes(x = median_house_value)) +
  geom_histogram(fill="#a6bddb", color="#e9ecef", alpha=0.9) +
  geom_vline(aes(xintercept = mean_house), 
             color = "black", linetype = "dashed", size = 1, alpha = 0.8) +
  labs(title = "Distribution of Median House Value",
       x = "Median House Value",
       caption = paste("Median House Value:", round(mean_house, 2))) +
  theme_minimal()

show(avg_value_distribution)

In this visualization, we see that the median house value distribution has a heavy right skew, with a lot of values that are placed above $2,000,000. The overall median value is $630,683.07, showing that most houses in California tend to fall on the more pricier side.

mean_house <- mean(log(ca_housing$median_house_value), na.rm = TRUE)

avg_value_distribution <- ggplot(ca_housing, aes(x = log(median_house_value))) +
  geom_histogram(fill="#a6bddb", color="#e9ecef", alpha=0.9) +
  geom_vline(aes(xintercept = mean_house), 
             color = "black", linetype = "dashed", size = 1, alpha = 0.8) +
  labs(title = "Distribution of Median House Value",
       x = "Median House Value",
       caption = paste("Median House Value:", round(mean_house, 2))) +
  theme_minimal()

show(avg_value_distribution)

When we log-transform the median house value, we get a much more normal distribution, which is easier to work with.

ocean_prox_avg_house_income <- ggplot(ca_housing, aes(x = ocean_proximity, y = average_household_income, fill = ocean_proximity)) +
  geom_boxplot() +
  labs(title = "Ocean Proximity and Average Household Income",
       y = "Average household income",
       fill = "Ocean proximity") +
  theme_minimal() +
  theme(axis.title.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.text.x = element_blank())

show(ocean_prox_avg_house_income)

In this visualization, we can see that the the average household incomes are higher for areas closer to the ocean, but both Inland and <1H Ocean still remain relatively high. It should be noted that the Inland has a significant amount of outliers, suggesting there may be a skew in distribution among the ocean proximity data of housing in California.

To get a better view of the distances of Ocean Proximity and Average Household Income, here are the data points plotted onto a map of California:

ca_map <- map_data("state") |>
  subset(region == "california")
ggplot() +
  geom_polygon(data = ca_map, aes(x = long, y = lat, group = group),fill = "white",color = "gray50") +
  geom_point(data = ca_housing,
             aes(x = longitude, y = latitude,color = average_household_income,size=total_households , alpha=total_population)) +scale_size_area(max_size = 1) +
  scale_color_viridis_c(
    name = "Average Household Income",
    labels = dollar_format(),
    option = "plasma"
  ) +
  labs(
    title = "Average Household Incomes in California",
    x = "Longitude",
    y = "Latitude",
    size = 'Total Households',
    alpha = 'Total Population'
  ) +
  theme_minimal() +
  coord_fixed(
    xlim = c(-125, -113),
    ylim = c(32, 43),
    ratio = 1.3
  )

As shown above, we see that the more inland you go (in terms of longitude), the lower the household income is, whereas the closer to the coast one is, the higher the average household income is. This is due to most of California’s “high-paying” jobs are situated by the coasts: Silicon Valley, Hollywood, and the upcoming biomedical industry down in San Diego.

Additionally, here is a plot of median house values in California:

ggplot() +
  geom_polygon(data = ca_map, aes(x = long, y = lat, group = group),fill = "white",color = "gray50") +
  geom_point(data = ca_housing,
             aes(x = longitude, y = latitude,color = median_house_value,size=total_households , alpha=total_population)) +scale_size_area(max_size = 1) +
  scale_color_viridis_c(
    name = "Median House Value",
    labels = dollar_format(),
    option = "plasma"
  ) +
  labs(
    title = "Median House Values in California",
    x = "Longitude",
    y = "Latitude",
    size = 'Total Households',
    alpha = 'Total Population'
  ) +
  theme_minimal() +
  coord_fixed(
    xlim = c(-125, -113),
    ylim = c(32, 43),
    ratio = 1.3
  )

Similar to the mapping of average household incomes, the median house values increase significantly the closer it gets to the coast. This suggests that the median housing value and average household income is correlated, given their similar patterns in geography.

ca_housing_cor <- cor(ca_housing |> dplyr::select_if(is.numeric))
corrplot::corrplot(ca_housing_cor, tl.cex = 0.5)

The correlation plot above further cements this, showing that average_household_income and median_house_value are stronger when it comes to determining the value of a house in California, suggesting that the higher someone makes annually, the more pricier their house may be.

top_counties_by_house_value <- ca_housing |>
  filter(median_house_value >= 0) |>
  group_by(name) |>
  summarise(
    total_county_value = sum(median_house_value, na.rm = TRUE),
    average_county_house_value = mean(median_house_value, na.rm = TRUE),
    total_county_houses = n()
  ) |>
  arrange(desc(average_county_house_value)) |>
  slice_head(n = 10)

# Print the results
print(top_counties_by_house_value)
## # A tibble: 10 × 4
##    name            total_county_value average_county_house…¹ total_county_houses
##    <chr>                        <dbl>                  <dbl>               <int>
##  1 San Mateo Coun…          605041480               1281868.                 472
##  2 San Francisco …          705052059               1272657.                 554
##  3 Marin County             202481021               1191065.                 170
##  4 Santa Clara Co…         1261735439               1151218.                1096
##  5 Alameda County           846174813                819937.                1032
##  6 Santa Cruz Cou…          109165500                785363.                 139
##  7 Santa Barbara …          207231515                770377.                 269
##  8 Orange County           1359221055                753866.                1803
##  9 Napa County               77606403                739109.                 105
## 10 Contra Costa C…          476905204                701331.                 680
## # ℹ abbreviated name: ¹​average_county_house_value
top_counties_by_house_value |>
  ggplot(aes(x = reorder(name, average_county_house_value), 
             y = average_county_house_value,
             fill = name)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = dollar(average_county_house_value)), 
            hjust = -0.1, 
            size = 3) +
  coord_flip() +
  labs(
    title = "Top 10 Counties with Highest Housing Value",
    x = "County",
    y = "Average House Value"
  ) +
  theme_classic() +
  theme(legend.position = "none",
        axis.ticks.x = element_blank(),
        axis.text.x = element_blank(),
        axis.title.y = element_blank()) +
  expand_limits(y = max(top_counties_by_house_value$average_county_house_value) * 1.1)

In the visualization, the top 10 counties with the highest average house values are displayed; based on this display, San Mateo County has the highest average value with approximately $1,281,868. According to the California county map, San Mateo County consists of primarily the Bay Area, excluding San Francisco; some of the cities included in the county are Pleasanton, Daly City, South San Francisco, and East Palo Alto.

Analysis

In our analysis, we want to look at utilizing both linear regression and random forest algorithms, and comparing the results to determine which model is more accurate in predicting the price of a house in California with our provided dataset.

# Refactoring county name and ocean_proximity 
ca_housing <- ca_housing |>
  mutate(across(c(name, ocean_proximity), as.factor)) 

ca_housing <- ca_housing |> 
  mutate(median_house_value = log(median_house_value))

#splitting the data into training and testing sets
ca_split <- rsample::initial_split(data = ca_housing, prop = 2/3)
train_ca <- rsample::training(ca_split)
test_ca <- rsample::testing(ca_split)

# Recipe for linear model
housing_rec <- recipe(train_ca) |>
    update_role(everything(), new_role = "predictor") |>
    update_role(median_house_value, new_role = "outcome") |>
    step_dummy(name, ocean_proximity, one_hot = TRUE) |>
    step_corr(all_numeric()) |>
    step_nzv(all_numeric()) 

First, variables name and ocean_proximity are refactored for the linear model, as these characteristics affect a house’s value. We then split the dataset into a training and testing set, and then applying a recipe to the housing training set. This recipe is loosely based on the recipe from CS02.

prepped_rec <- prep(housing_rec, verbose = TRUE, retain = TRUE)
## oper 1 step dummy [training] 
## oper 2 step corr [training] 
## oper 3 step nzv [training] 
## The retained training set is ~ 1.77 Mb  in memory.
baked_train <- bake(prepped_rec, new_data = NULL)
baked_test_ca <- bake(prepped_rec, new_data = test_ca)

glimpse(baked_test_ca)
## Rows: 7,725
## Columns: 15
## $ total_population           <dbl> 1713, 1322, 1206, 1750, 2678, 748, 595, 150…
## $ total_households           <dbl> 644, 630, 423, 591, 1171, 241, 231, 758, 41…
## $ average_household_income   <dbl> 150520.19, 167222.22, 167677.30, 149141.29,…
## $ average_house_age          <dbl> 23.5, 16.7, 36.9, 37.4, 39.7, 40.1, 36.4, 2…
## $ total_rooms                <dbl> 5300, 4535, 2583, 3084, 6155, 1334, 1404, 3…
## $ median_house_value         <dbl> 14.30599, 13.93489, 14.18528, 14.03641, 13.…
## $ longitude                  <dbl> -122.2344, -122.2222, -122.2480, -122.2587,…
## $ name_Los.Angeles.County    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ name_Orange.County         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ name_Riverside.County      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ name_San.Diego.County      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ocean_proximity_X.1H.OCEAN <dbl> 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0…
## $ ocean_proximity_INLAND     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ocean_proximity_NEAR.BAY   <dbl> 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1…
## $ ocean_proximity_NEAR.OCEAN <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

Here, we bake the training dataset before fitting it to the linear model.

lm_CA_model <- parsnip::linear_reg() |>
  parsnip::set_engine("lm") |>
  set_mode("regression")

CA_wflow <- workflows::workflow() |>
    workflows::add_recipe(housing_rec) |>
    workflows::add_model(lm_CA_model)

CA_wflow_fit <- fit(CA_wflow, data = train_ca)

wflowoutput <- CA_wflow_fit |> 
  extract_fit_parsnip() |> 
  broom::tidy() 

wflowoutput
## # A tibble: 15 × 5
##    term                           estimate    std.error statistic    p.value
##    <chr>                             <dbl>        <dbl>     <dbl>      <dbl>
##  1 (Intercept)                 5.03         0.295         17.0     1.97e- 64
##  2 total_population           -0.000163     0.00000736   -22.1     2.12e-106
##  3 total_households            0.000602     0.0000299     20.1     1.13e- 88
##  4 average_household_income    0.0000132    0.000000117  112.      0        
##  5 average_house_age           0.0133       0.000494      27.0     5.87e-157
##  6 total_rooms                 0.000000190  0.00000467     0.0407  9.68e-  1
##  7 longitude                  -0.0534       0.00248      -21.5     3.02e-101
##  8 name_Los.Angeles.County     0.419        0.0101        41.6     0        
##  9 name_Orange.County          0.320        0.0142        22.5     1.25e-110
## 10 name_Riverside.County       0.151        0.0159         9.53    1.74e- 21
## 11 name_San.Diego.County       0.260        0.0143        18.2     6.04e- 73
## 12 ocean_proximity_X.1H.OCEAN -0.0264       0.0103        -2.56    1.04e-  2
## 13 ocean_proximity_INLAND     -0.148        0.00991      -15.0     2.50e- 50
## 14 ocean_proximity_NEAR.BAY    0.0988       0.0150         6.56    5.44e- 11
## 15 ocean_proximity_NEAR.OCEAN NA           NA             NA      NA
CA_wflow_fit |> 
  extract_fit_parsnip() |> 
  vip::vip(num_features = 10)

After preparing and baking our training dataset with the previous recipe, we set our model to “Linear Regression” and then fit our dataset to the model. We then extract the top 10 features from the dataset that determine what features are the most important in determining house value. According to the output, we find that the most important feature in determining house value is average_household_income, which makes sense given that the more someone makes, the pricier their house may be.

wf_fit <- CA_wflow_fit |> 
  extract_fit_parsnip()

wf_fitted_values <- 
  broom::augment(wf_fit[["fit"]], data = baked_train) |> 
  select(median_house_value, .fitted:.std.resid)


yardstick::metrics(wf_fitted_values,
                   truth = median_house_value, estimate = .fitted)
## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard       0.379
## 2 rsq     standard       0.655
## 3 mae     standard       0.275
ggplot(wf_fitted_values, aes(x = median_house_value, y = .fitted)) +
  geom_point(alpha = 0.5, color = "blue") +  # Scatter plot points
  geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +  # Perfect prediction line
  labs(
    title = "Predicted vs Actual Values",
    x = "Actual Median House Value (log-transformed)",
    y = "Predicted Median House Value"
  ) +
  theme_minimal()

Above, we look at the model’s performance. The Root Mean Squared Error, a metric that determines a model’s overall performance by gauging the average difference between a value and a model’s predicted output comes to around 0.3785. We suspect that the model outputs this error because of the right-skewed distribution of the median house values, with a lot of values that are above 2,000,000. To alleviate, the model would then only consider prices at the 97th percentile and below (\(<2,000,000\)). Additionally, the R-squared value for this model comes out to \(0.61\), or \(61\%\).

# Re-filtering the data to only consider less affluent housing prices.
price_threshold <- quantile(ca_housing$median_house_value, 0.97)
ca_housing <- ca_housing %>%
  filter(median_house_value <= price_threshold)
#Log transforming to get a better distribution 
#ca_housing <- ca_housing |>
   #mutate(median_house_value = log(median_house_value))

# Re-splitting the data into training and testing sets
ca_split <- rsample::initial_split(data = ca_housing, prop = 2/3)
train_ca <- rsample::training(ca_split)
test_ca <- rsample::testing(ca_split)

housing_rec <- recipe(train_ca) |>
    update_role(everything(), new_role = "predictor") |>
    update_role(median_house_value, new_role = "outcome") |>
    step_dummy(name, ocean_proximity, one_hot = TRUE) |>
    step_corr(all_numeric()) |>
    step_nzv(all_numeric()) 

Here, we re-run the model, applying the previous explanation of looking at all housing values (at most) in the 97th percentile.

# Prepping and training the model
prepped_rec <- prep(housing_rec, verbose = TRUE, retain = TRUE)
## oper 1 step dummy [training] 
## oper 2 step corr [training] 
## oper 3 step nzv [training] 
## The retained training set is ~ 1.83 Mb  in memory.
baked_train <- bake(prepped_rec, new_data = NULL)
baked_test_ca <- bake(prepped_rec, new_data = test_ca)

lm_CA_model <- parsnip::linear_reg() |>
  parsnip::set_engine("lm") |>
  set_mode("regression")

CA_wflow <- workflows::workflow() |>
    workflows::add_recipe(housing_rec) |>
    workflows::add_model(lm_CA_model)

CA_wflow_fit <- fit(CA_wflow, data = train_ca)

wflowoutput <- CA_wflow_fit |> 
  extract_fit_parsnip() |> 
  broom::tidy() 

wflowoutput
## # A tibble: 16 × 5
##    term                          estimate    std.error statistic    p.value
##    <chr>                            <dbl>        <dbl>     <dbl>      <dbl>
##  1 (Intercept)                 4.85        0.326          14.9    1.08e- 49
##  2 total_population           -0.000137    0.00000743    -18.5    1.56e- 75
##  3 total_households            0.000592    0.0000305      19.4    1.42e- 82
##  4 average_household_income    0.0000127   0.000000120   106.     0        
##  5 average_house_age           0.0126      0.000498       25.3    7.90e-138
##  6 total_rooms                -0.00000883  0.00000480     -1.84   6.60e-  2
##  7 longitude                  -0.0551      0.00273       -20.2    1.41e- 89
##  8 name_Los.Angeles.County     0.412       0.0110         37.4    1.88e-292
##  9 name_Orange.County          0.321       0.0148         21.6    5.09e-102
## 10 name_Riverside.County       0.146       0.0168          8.69   4.11e- 18
## 11 name_San.Bernardino.County  0.00992     0.0161          0.614  5.39e-  1
## 12 name_San.Diego.County       0.275       0.0151         18.2    4.57e- 73
## 13 ocean_proximity_X.1H.OCEAN -0.0185      0.0104         -1.79   7.35e-  2
## 14 ocean_proximity_INLAND     -0.142       0.00999       -14.2    1.82e- 45
## 15 ocean_proximity_NEAR.BAY    0.0917      0.0156          5.87   4.36e-  9
## 16 ocean_proximity_NEAR.OCEAN NA          NA              NA     NA
CA_wflow_fit |> 
  extract_fit_parsnip() |> 
  vip::vip(num_features = 10)

# Looking at the results from excluding anything above the 97th percentile
wf_fit <- CA_wflow_fit |> 
  extract_fit_parsnip()

wf_fitted_values <- 
  broom::augment(wf_fit[["fit"]], data = baked_train) |> 
  select(median_house_value, .fitted:.std.resid)


yardstick::metrics(wf_fitted_values, truth = median_house_value, estimate = .fitted)
## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard       0.377
## 2 rsq     standard       0.621
## 3 mae     standard       0.269

Like the original mode, we bake and train the model excluding everything above the 97th percentile, examining the results. This model has a slightly lower RMSE than the previous one, but with roughly the same R-squared value. Like the original linear regression model, the most important feature remains the average_household_income.

In order to make a comparison to the linear regression model above, we apply our dataset to a Random Forest algorithm; however, we include all parts of the dataset (> 97th percentile) to see the general accuracy.

ca_housing <- read_csv("data/california_housing_updated.csv")
ca_housing <- ca_housing[!apply(ca_housing == -666666666, 1, any), ]
# Refactoring county name and ocean_proximity 
ca_housing <- ca_housing |>
  mutate(across(c(name, ocean_proximity), as.factor))

# Splitting the data into training and testing sets
ca_split <- rsample::initial_split(data = ca_housing, prop = 2/3)
train_ca <- rsample::training(ca_split)
test_ca <- rsample::testing(ca_split)

RF_rec <- recipe(train_ca) |>
    update_role(everything(), new_role = "predictor") %>%
    update_role(median_house_value, new_role = "outcome") %>%
    step_string2factor("ocean_proximity") %>%
    step_dummy(all_nominal_predictors()) %>%
    # Handle correlation
    step_corr(all_numeric(), threshold = 0.8) %>%
    # Remove near-zero variance
    step_nzv(all_numeric()) %>%
    # Normalize numeric predictors
    step_normalize(all_numeric_predictors())

# install.packages("randomForest")
tune_RF_model <- rand_forest(mtry = tune(), min_n = tune()) |>
  set_engine("randomForest") |>
  set_mode("regression")
    
tune_RF_model
## Random Forest Model Specification (regression)
## 
## Main Arguments:
##   mtry = tune()
##   min_n = tune()
## 
## Computational engine: randomForest
# Using Cross-fold validation for better model performance.
vfold_ca <- rsample::vfold_cv(data = train_ca, v = 4)

RF_tune_wflow <- workflows::workflow() |>
  workflows::add_recipe(RF_rec) |>
  workflows::add_model(tune_RF_model)

n_cores <- parallel::detectCores()
doParallel::registerDoParallel(cores = n_cores)
tune_RF_results <- tune_grid(object = RF_tune_wflow, resamples=vfold_ca, grid = 20)
tune_RF_results
## # Tuning results
## # 4-fold cross-validation 
## # A tibble: 4 × 4
##   splits               id    .metrics          .notes          
##   <list>               <chr> <list>            <list>          
## 1 <split [11587/3863]> Fold1 <tibble [40 × 6]> <tibble [0 × 3]>
## 2 <split [11587/3863]> Fold2 <tibble [40 × 6]> <tibble [0 × 3]>
## 3 <split [11588/3862]> Fold3 <tibble [40 × 6]> <tibble [0 × 3]>
## 4 <split [11588/3862]> Fold4 <tibble [40 × 6]> <tibble [1 × 3]>
## 
## There were issues with some computations:
## 
##   - Warning(s) x1: the standard deviation is zero, The correlation matrix has missin...
## 
## Run `show_notes(.Last.tune.result)` for more information.

Here, we utilize a similar formula as the Linear Regression model, but adjusting the algorithm used, and adjusting the recipe according to the Random Forest implementation.

tune_RF_results |>
  collect_metrics()
## # A tibble: 40 × 8
##     mtry min_n .metric .estimator       mean     n    std_err .config           
##    <int> <int> <chr>   <chr>           <dbl> <int>      <dbl> <chr>             
##  1     4    39 rmse    standard   226228.        4 4928.      Preprocessor1_Mod…
##  2     4    39 rsq     standard        0.703     4    0.00593 Preprocessor1_Mod…
##  3     6    15 rmse    standard   222977.        4 5178.      Preprocessor1_Mod…
##  4     6    15 rsq     standard        0.709     4    0.00729 Preprocessor1_Mod…
##  5     8    31 rmse    standard   225946.        4 5446.      Preprocessor1_Mod…
##  6     8    31 rsq     standard        0.701     4    0.00806 Preprocessor1_Mod…
##  7     3    19 rmse    standard   229811.        4 5144.      Preprocessor1_Mod…
##  8     3    19 rsq     standard        0.699     4    0.00579 Preprocessor1_Mod…
##  9     7    28 rmse    standard   224745.        4 5230.      Preprocessor1_Mod…
## 10     7    28 rsq     standard        0.704     4    0.00763 Preprocessor1_Mod…
## # ℹ 30 more rows
show_best(tune_RF_results, metric = "rmse", n = 1)
## # A tibble: 1 × 8
##    mtry min_n .metric .estimator    mean     n std_err .config              
##   <int> <int> <chr>   <chr>        <dbl> <int>   <dbl> <chr>                
## 1     5     7 rmse    standard   222035.     4   5078. Preprocessor1_Model20

After running the Random Forest model, we find that the RMSE value (with \(n=3\)) is \(222983.1\).

Final Model Evaluation

tuned_RF_values<- select_best(tune_RF_results)
tuned_RF_values
## # A tibble: 1 × 3
##    mtry min_n .config              
##   <int> <int> <chr>                
## 1     5     7 Preprocessor1_Model20
# specify best combination from tune in workflow
RF_tuned_wflow <-RF_tune_wflow |>
  tune::finalize_workflow(tuned_RF_values)

# fit model with those parameters on train AND test
overallfit <- RF_tuned_wflow |>
  tune::last_fit(ca_split)

collect_metrics(overallfit)
## # A tibble: 2 × 4
##   .metric .estimator  .estimate .config             
##   <chr>   <chr>           <dbl> <chr>               
## 1 rmse    standard   224931.    Preprocessor1_Model1
## 2 rsq     standard        0.692 Preprocessor1_Model1
test_predictions <- collect_predictions(overallfit)

The RMSE for this model is around \(~215,000\), with the R-squared value at \(.73\), which is higher than the Linear Regression model.

overallfit |> 
  extract_fit_parsnip() |> 
  vip::vip(num_features = 10)

# fit data
RF_final_train_fit <- parsnip::fit(RF_tuned_wflow, data = train_ca)
RF_final_test_fit <- parsnip::fit(RF_tuned_wflow, data = test_ca)

# get predictions on training data
values_pred_train <- predict(RF_final_train_fit, train_ca) |> 
  bind_cols(train_ca |> select(median_house_value, name, ocean_proximity)) 

# get predictions on testing data
values_pred_test <- predict(RF_final_test_fit, test_ca) |> 
  bind_cols(test_ca |> select(median_house_value, name, ocean_proximity)) 

After running the Random Forest model on both the training and testing data, and selecting median_house_value, name, and ocean_proximity, we find the most important feature continues to remain average_household_income, followed by longitude and average_house_age.

Overall Results

To visualize our models’ performance, we plotted the actual values of median house value for our data in the test set against each model’s predicted values.

Linear Regression Model Results

wf_fitted_values |> 
  ggplot(aes(x =  median_house_value, y = .fitted)) + 
  geom_point(alpha = 0.5, color = "#2C3E50") + 
  geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed", size = 1) +
  labs(
    title = "Predicted vs. Actual Median House Values",
    subtitle = "Perfect predictions would fall on the dashed red line",
    x = "Predicted Values ($)",
    y = "Actual Values ($)",
    caption = "Model used is linear regression"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold"),
    plot.subtitle = element_text(size = 10, color = "gray40"),
    panel.grid.major = element_line(color = "gray90"),
    panel.grid.minor = element_line(color = "gray95"),
    axis.text = element_text(size = 10),
    axis.title = element_text(size = 12)
  )

yardstick::metrics(wf_fitted_values, truth = median_house_value, estimate = .fitted)
## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard       0.377
## 2 rsq     standard       0.621
## 3 mae     standard       0.269

To reiterate, the RMSE of this model is \(~211,000\) whilst its R-squared value is \(.61\).

Random Forest Model Results

This is our graph of actual values of median house value versus the predicted values in our RF model.

ggplot(values_pred_test, aes(x=median_house_value, y=.pred))+
  geom_point(alpha = 0.5, color = "#2C3E50") +
  geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed", size = 1) +
  labs(
    title = "Predicted vs. Actual Median House Value for Random Forest Modelling",
    subtitle = "Perfect predictions would fall on the dashed red line",
    x = "Actual Values ($)",
    y = "Predicted Values ($)",
    caption = "Model used is Random Forest"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold"),
    plot.subtitle = element_text(size = 10, color = "gray40"),
    panel.grid.major = element_line(color = "gray90"),
    panel.grid.minor = element_line(color = "gray95"),
    axis.text = element_text(size = 10),
    axis.title = element_text(size = 12)
  )

collect_metrics(overallfit)
## # A tibble: 2 × 4
##   .metric .estimator  .estimate .config             
##   <chr>   <chr>           <dbl> <chr>               
## 1 rmse    standard   224931.    Preprocessor1_Model1
## 2 rsq     standard        0.692 Preprocessor1_Model1

The RMSE for the Random Forest model is \(215,000\), whilst its R-squared value is at \(.73\).

Discussion

In comparing both models, we find that Random Forest performs better than the Linear Regression in predicting based on its respective R-Squared value of \(0.73\) (compared to LR’s \(0.61\)). When comparing the predicted and actual house values visually for both models, the Random Forest’s data points are closer to the “perfect predictions” than Linear Regression, as the latter clusters further on the x-axis; however, the Random Forest fails to properly predict any value above \(2,000,000\), as those values are outliers not accounted for by the implemented model. This limitation could be attributed to the inherent nature of ensemble tree-based models like Random Forest, which struggle with extreme outliers due to their decision tree structure. Tree-based models tend to create splits based on the majority of the data, making them less sensitive to extreme values that are far from the central distribution. The linear model appears more uniform because it attempts to create a global linear relationship across all data points, whereas the Random Forest creates multiple decision trees that can fragment the prediction space. Furthermore, the Random Forest has a higher RMSE of \(215,000\), whereas Linear Regression has \(~211,000\), indicating that the Linear Regression model is more accurate in predicting overall house value in California.

Overall, the important features of each model remain relatively the same across the board, with average_household_income remaining the dominant feature for both Linear Regression and Random Forest. For both models, this dominant feature makes sense given that most individuals with a high household income would typically have pricier living conditions (and vice versa). This is visible in the EDA, where both median_housing_value and average_household_income are concentrated within certain areas such as the Bay Area and San Diego. More so, ocean_proximity does also play a significant part in both Linear Regression and Random Forest, as most “beachfront” and “bay-front” properties are often pricier than most houses, given their prime locations in the state.

In our analysis of our California Housing dataset, one limiting factor that arose were the several outliers present within the dataset, specifically within the median_housing_value factor. As mentioned previously, the Random Forest (and initial Linear Regression) model failed to properly identify and predict any values higher than 2,000,000 (the 97th percentile). Since these values could not be normalized, their predictions become less reliable, potentially due to the models’ inability to capture the extreme variations in high-value property markets, such as those in premium coastal areas or tech hub regions like Silicon Valley.

Conclusion

The study successfully developed two distinct predictive models—Linear Regression and Random Forest—to estimate median house values in California. While both models demonstrated similar root mean square error (RMSE) of approximately \(210,000\), the Random Forest model distinguished itself with a notably higher R-squared value of \(0.73\), compared to the Linear Regression model’s \(0.61\). This higher R-squared indicates that the Random Forest model explains 73% of the variance in median house values, suggesting a more nuanced and comprehensive capture of the complex relationships within the California housing market.

The model performances highlight the dataset’s complexity, particularly in capturing high-value property variations (as shown with the several outliers). Although the Random Forest model showed superior explanatory power, both models consistently identified key predictors such as average_household_income and ocean_proximity as critical factors influencing house values. These findings not only provide insights into California’s housing market dynamics, but also demonstrate the potential of machine learning techniques in real estate price prediction.

Future research could focus on addressing the models’ limitations, particularly their challenges with high-value property outliers, potentially through advanced techniques like XGBoost or GMMs.

References


  1. Tikkanen, A. (2024, January 17). list of U.S. states by population. Encyclopedia Britannica. https://www.britannica.com/topic/largest-U-S-state-by-population↩︎

  2. California’s Population is Increasing. (2024, April 30). California Governor; CAWeb Publishing Service. https://www.gov.ca.gov/2024/04/30/californias-population-is-increasing/.↩︎

  3. Zillow. (2024). San Francisco, CA Housing Market. https://www.zillow.com/home-values/20330/san-francisco-ca/.↩︎

  4. Bohn, S., & Duan, J. (2024, October 14). California’s Economy. Public Policy Institute of California. https://www.ppic.org/publication/californias-economy/.↩︎

  5. Nicolaides, B. M. (2023). The Historic Suburban Landscapes of Los Angeles. Oxford University Press EBooks, 19–68. https://doi.org/10.1093/oso/9780197578308.003.0002↩︎