With a population of approximately 39 million, California continues to symbolize opportunity and innovation, attracting people from across the nation and around the world1. Counties such as the Central Valley, Bay Area, Inland Empire, Los Angeles, and Orange County have experienced notable population surges. However, only five cities within these regions have seen substantial increases, bypassing more traditional urban hubs2. This trend coincides with skyrocketing housing costs—San Francisco’s average home value, for instance, has climbed to an astonishing $1,262,2343.
Historically, California has been one of the most desirable places to live, largely due to its economic opportunities. With the largest GDP in the United States—and the fifth largest globally, at $3.9 trillion in 20234—the state draws individuals seeking diverse job prospects across industries such as technology, entertainment, and agriculture. Yet, these opportunities come at a cost: California faces a high cost of living (HCOL) and a severe housing shortage, particularly in its urbanized areas. As the demand for housing in cities continues to outpace supply, more residents are turning to suburban areas, reshaping the state’s housing landscape and intensifying debates over affordability, equity, and sustainable development.
Becky Nicolaides’s The New Suburbia: How Diversity Remade Suburban Life in Los Angeles after 1945 (2023) provides valuable historical context for this shift. According to Nicolaides, Los Angeles adopted a “suburban identity” by the 1940s to accommodate a growing population fueled by WWII and the Baby Boom. City leaders throughout the county pursued decentralized development, creating what Nicolaides describes as “borderlands, streetcar suburbs, edge cities, and corporate suburbs” (Nicolaides, 2023)5. These initiatives left a blueprint for accommodating those who sought proximity to urban centers without incurring the high costs associated with city living. Today, much of California’s population resides in suburban areas, commuting to nearby urban centers for work, entertainment, and other activities.
The significance of this research lies in its practical application: with housing affordability becoming an increasingly urgent issue, this projects provides insight in navigating California’s complicated housing market. By examining geographical and economic data, this project aims to pinpoint possible locations for buying and owning a home without breaking the bank, all while having ideal traits and examining current housing market trends.
How accurately can housing prices in California be predicted based on geographical factors and regional population dynamics?
Here we load the respective packages tidyverse
,
tidymodels
, olsrr
, and conflicted
to handle cleaning, analysis, modeling, and predicting aspects of the
data.
#install.packages("maps")
library(tidyverse)
library(tidymodels)
library(olsrr)
library(conflicted)
library(scales)
library(maps)
conflict_prefer("filter", "dplyr")
conflict_prefer("select", "dplyr")
The data that we’re examining for the question is an Updated
California Housing Prices dataset, which contains information gathered
from the US Census Bureau from 2009-2020. The data contains 11
variables, which are name
, total_population
,
total_households
, average_household_income
,
average_house_age
, total_rooms
,
total_bedrooms
, median_house_value
,
longitude
, latitude
, and
ocean_proximity
. The original data card can be found on kaggle.com
We import the data using read_csv()
. The data can be
found under the data
directory of this repository.
Here are a few summary statistics of the original data import.
summary(ca_housing)
## name total_population total_households average_household_income
## Length:25591 Min. : 0 Min. : 0.0 Min. : 0
## Class :character 1st Qu.: 1009 1st Qu.: 348.0 1st Qu.: 66178
## Mode :character Median : 1421 Median : 475.0 Median : 89021
## Mean : 1537 Mean : 512.0 Mean : 92005
## 3rd Qu.: 1940 3rd Qu.: 640.5 3rd Qu.:115610
## Max. :39373 Max. :7133.0 Max. :212500
##
## average_house_age total_rooms total_bedrooms median_house_value
## Min. : 0.00 Min. :-666666666 Min. : 0.0 Min. :-666666666
## 1st Qu.:21.60 1st Qu.: 1908 1st Qu.: 376.0 1st Qu.: 298400
## Median :26.90 Median : 2639 Median : 514.0 Median : 481400
## Mean :26.51 Mean : -3388202 Mean : 555.3 Mean : -62367652
## 3rd Qu.:32.10 3rd Qu.: 3634 3rd Qu.: 694.0 3rd Qu.: 735150
## Max. :45.50 Max. : 46497 Max. :7960.0 Max. : 2000001
## NA's :34
## longitude latitude ocean_proximity
## Min. :-124.2 Min. :32.55 Length:25591
## 1st Qu.:-121.6 1st Qu.:33.93 Class :character
## Median :-118.6 Median :34.26 Mode :character
## Mean :-119.5 Mean :35.54
## 3rd Qu.:-118.0 3rd Qu.:37.68
## Max. :-114.3 Max. :41.75
##
ca_housing
## # A tibble: 25,591 × 11
## name total_population total_households average_household_income
## <chr> <dbl> <dbl> <dbl>
## 1 Alameda County 1713 644 150520.
## 2 Alameda County 1322 630 167222.
## 3 Alameda County 940 417 171948.
## 4 Alameda County 1043 413 150914.
## 5 Alameda County 1206 423 167677.
## 6 Alameda County 1318 775 122429.
## 7 Alameda County 1076 590 79110.
## 8 Alameda County 1458 631 143221.
## 9 Alameda County 1544 706 124129.
## 10 Alameda County 885 443 118708.
## # ℹ 25,581 more rows
## # ℹ 7 more variables: average_house_age <dbl>, total_rooms <dbl>,
## # total_bedrooms <dbl>, median_house_value <dbl>, longitude <dbl>,
## # latitude <dbl>, ocean_proximity <chr>
# Dropping all the rows that contain the value -666666666
ca_housing <- ca_housing[!apply(ca_housing == -666666666, 1, any), ]
skimr::skim(ca_housing)
Name | ca_housing |
Number of rows | 23175 |
Number of columns | 11 |
_______________________ | |
Column type frequency: | |
character | 2 |
numeric | 9 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
name | 0 | 1 | 11 | 22 | 0 | 58 | 0 |
ocean_proximity | 0 | 1 | 6 | 10 | 0 | 4 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
total_population | 0 | 1 | 1558.52 | 712.18 | 26.00 | 1035.00 | 1447.00 | 1962.00 | 9793.00 | ▇▂▁▁▁ |
total_households | 0 | 1 | 518.68 | 225.45 | 18.00 | 355.00 | 480.00 | 645.50 | 3465.00 | ▇▂▁▁▁ |
average_household_income | 0 | 1 | 95206.56 | 32698.05 | 12563.83 | 69637.62 | 92172.39 | 117990.33 | 212500.00 | ▂▇▇▂▁ |
average_house_age | 0 | 1 | 26.70 | 7.56 | 3.50 | 21.70 | 27.10 | 32.20 | 45.50 | ▁▃▇▇▂ |
total_rooms | 0 | 1 | 3018.50 | 1405.95 | 73.00 | 2013.00 | 2747.00 | 3744.00 | 16777.00 | ▇▃▁▁▁ |
total_bedrooms | 0 | 1 | 561.24 | 244.37 | 18.00 | 382.00 | 519.00 | 698.00 | 3577.00 | ▇▂▁▁▁ |
median_house_value | 0 | 1 | 630683.07 | 410368.30 | 9999.00 | 355700.00 | 532300.00 | 785650.00 | 2000001.00 | ▆▇▃▁▁ |
longitude | 0 | 1 | -119.51 | 1.97 | -124.16 | -121.72 | -118.66 | -117.95 | -114.35 | ▂▆▅▇▁ |
latitude | 0 | 1 | 35.56 | 2.08 | 32.55 | 33.93 | 34.27 | 37.69 | 41.75 | ▇▁▅▂▁ |
Since some of the columns in the dataset
(e.g. average_household_income
and
median_house_value
) contain values less than zero, we clean
the dataset of those values, as they would throw off any EDA and
analysis performed.
mean_income <- mean(ca_housing$average_household_income, na.rm = TRUE)
avg_income_distribution <- ggplot(ca_housing, aes(x = average_household_income)) +
geom_histogram(fill="#a6bddb", color="#e9ecef", alpha=0.9) +
geom_vline(aes(xintercept = mean_income),
color = "black", linetype = "dashed", size = 1, alpha = 0.8) +
labs(title = "Distribution of Average Household Income",
x = "Average Household Income",
caption = paste("Mean Income:", round(mean_income, 2))) +
theme_minimal()
show(avg_income_distribution)
In this visualization, we see that the average household income is overall balanced, with a slight right skew. The average household income for all data points in the dataset is $95,206.56.
mean_house <- mean(ca_housing$median_house_value, na.rm = TRUE)
avg_value_distribution <- ggplot(ca_housing, aes(x = median_house_value)) +
geom_histogram(fill="#a6bddb", color="#e9ecef", alpha=0.9) +
geom_vline(aes(xintercept = mean_house),
color = "black", linetype = "dashed", size = 1, alpha = 0.8) +
labs(title = "Distribution of Median House Value",
x = "Median House Value",
caption = paste("Median House Value:", round(mean_house, 2))) +
theme_minimal()
show(avg_value_distribution)
In this visualization, we see that the median house value distribution has a heavy right skew, with a lot of values that are placed above $2,000,000. The overall median value is $630,683.07, showing that most houses in California tend to fall on the more pricier side.
mean_house <- mean(log(ca_housing$median_house_value), na.rm = TRUE)
avg_value_distribution <- ggplot(ca_housing, aes(x = log(median_house_value))) +
geom_histogram(fill="#a6bddb", color="#e9ecef", alpha=0.9) +
geom_vline(aes(xintercept = mean_house),
color = "black", linetype = "dashed", size = 1, alpha = 0.8) +
labs(title = "Distribution of Median House Value",
x = "Median House Value",
caption = paste("Median House Value:", round(mean_house, 2))) +
theme_minimal()
show(avg_value_distribution)
When we log-transform the median house value, we get a much more normal distribution, which is easier to work with.
ocean_prox_avg_house_income <- ggplot(ca_housing, aes(x = ocean_proximity, y = average_household_income, fill = ocean_proximity)) +
geom_boxplot() +
labs(title = "Ocean Proximity and Average Household Income",
y = "Average household income",
fill = "Ocean proximity") +
theme_minimal() +
theme(axis.title.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.x = element_blank())
show(ocean_prox_avg_house_income)
In this visualization, we can see that the the average household incomes are higher for areas closer to the ocean, but both Inland and <1H Ocean still remain relatively high. It should be noted that the Inland has a significant amount of outliers, suggesting there may be a skew in distribution among the ocean proximity data of housing in California.
To get a better view of the distances of Ocean Proximity and Average Household Income, here are the data points plotted onto a map of California:
ca_map <- map_data("state") |>
subset(region == "california")
ggplot() +
geom_polygon(data = ca_map, aes(x = long, y = lat, group = group),fill = "white",color = "gray50") +
geom_point(data = ca_housing,
aes(x = longitude, y = latitude,color = average_household_income,size=total_households , alpha=total_population)) +scale_size_area(max_size = 1) +
scale_color_viridis_c(
name = "Average Household Income",
labels = dollar_format(),
option = "plasma"
) +
labs(
title = "Average Household Incomes in California",
x = "Longitude",
y = "Latitude",
size = 'Total Households',
alpha = 'Total Population'
) +
theme_minimal() +
coord_fixed(
xlim = c(-125, -113),
ylim = c(32, 43),
ratio = 1.3
)
As shown above, we see that the more inland you go (in terms of longitude), the lower the household income is, whereas the closer to the coast one is, the higher the average household income is. This is due to most of California’s “high-paying” jobs are situated by the coasts: Silicon Valley, Hollywood, and the upcoming biomedical industry down in San Diego.
Additionally, here is a plot of median house values in California:
ggplot() +
geom_polygon(data = ca_map, aes(x = long, y = lat, group = group),fill = "white",color = "gray50") +
geom_point(data = ca_housing,
aes(x = longitude, y = latitude,color = median_house_value,size=total_households , alpha=total_population)) +scale_size_area(max_size = 1) +
scale_color_viridis_c(
name = "Median House Value",
labels = dollar_format(),
option = "plasma"
) +
labs(
title = "Median House Values in California",
x = "Longitude",
y = "Latitude",
size = 'Total Households',
alpha = 'Total Population'
) +
theme_minimal() +
coord_fixed(
xlim = c(-125, -113),
ylim = c(32, 43),
ratio = 1.3
)
Similar to the mapping of average household incomes, the median house values increase significantly the closer it gets to the coast. This suggests that the median housing value and average household income is correlated, given their similar patterns in geography.
ca_housing_cor <- cor(ca_housing |> dplyr::select_if(is.numeric))
corrplot::corrplot(ca_housing_cor, tl.cex = 0.5)
The correlation plot above further cements this, showing that
average_household_income
and
median_house_value
are stronger when it comes to
determining the value of a house in California, suggesting that the
higher someone makes annually, the more pricier their house may be.
top_counties_by_house_value <- ca_housing |>
filter(median_house_value >= 0) |>
group_by(name) |>
summarise(
total_county_value = sum(median_house_value, na.rm = TRUE),
average_county_house_value = mean(median_house_value, na.rm = TRUE),
total_county_houses = n()
) |>
arrange(desc(average_county_house_value)) |>
slice_head(n = 10)
# Print the results
print(top_counties_by_house_value)
## # A tibble: 10 × 4
## name total_county_value average_county_house…¹ total_county_houses
## <chr> <dbl> <dbl> <int>
## 1 San Mateo Coun… 605041480 1281868. 472
## 2 San Francisco … 705052059 1272657. 554
## 3 Marin County 202481021 1191065. 170
## 4 Santa Clara Co… 1261735439 1151218. 1096
## 5 Alameda County 846174813 819937. 1032
## 6 Santa Cruz Cou… 109165500 785363. 139
## 7 Santa Barbara … 207231515 770377. 269
## 8 Orange County 1359221055 753866. 1803
## 9 Napa County 77606403 739109. 105
## 10 Contra Costa C… 476905204 701331. 680
## # ℹ abbreviated name: ¹average_county_house_value
top_counties_by_house_value |>
ggplot(aes(x = reorder(name, average_county_house_value),
y = average_county_house_value,
fill = name)) +
geom_bar(stat = "identity") +
geom_text(aes(label = dollar(average_county_house_value)),
hjust = -0.1,
size = 3) +
coord_flip() +
labs(
title = "Top 10 Counties with Highest Housing Value",
x = "County",
y = "Average House Value"
) +
theme_classic() +
theme(legend.position = "none",
axis.ticks.x = element_blank(),
axis.text.x = element_blank(),
axis.title.y = element_blank()) +
expand_limits(y = max(top_counties_by_house_value$average_county_house_value) * 1.1)
In the visualization, the top 10 counties with the highest average house values are displayed; based on this display, San Mateo County has the highest average value with approximately $1,281,868. According to the California county map, San Mateo County consists of primarily the Bay Area, excluding San Francisco; some of the cities included in the county are Pleasanton, Daly City, South San Francisco, and East Palo Alto.
In our analysis, we want to look at utilizing both linear regression and random forest algorithms, and comparing the results to determine which model is more accurate in predicting the price of a house in California with our provided dataset.
# Refactoring county name and ocean_proximity
ca_housing <- ca_housing |>
mutate(across(c(name, ocean_proximity), as.factor))
ca_housing <- ca_housing |>
mutate(median_house_value = log(median_house_value))
#splitting the data into training and testing sets
ca_split <- rsample::initial_split(data = ca_housing, prop = 2/3)
train_ca <- rsample::training(ca_split)
test_ca <- rsample::testing(ca_split)
# Recipe for linear model
housing_rec <- recipe(train_ca) |>
update_role(everything(), new_role = "predictor") |>
update_role(median_house_value, new_role = "outcome") |>
step_dummy(name, ocean_proximity, one_hot = TRUE) |>
step_corr(all_numeric()) |>
step_nzv(all_numeric())
First, variables name
and ocean_proximity
are refactored for the linear model, as these characteristics affect a
house’s value. We then split the dataset into a training and testing
set, and then applying a recipe to the housing training set. This recipe
is loosely based on the recipe from CS02.
prepped_rec <- prep(housing_rec, verbose = TRUE, retain = TRUE)
## oper 1 step dummy [training]
## oper 2 step corr [training]
## oper 3 step nzv [training]
## The retained training set is ~ 1.77 Mb in memory.
baked_train <- bake(prepped_rec, new_data = NULL)
baked_test_ca <- bake(prepped_rec, new_data = test_ca)
glimpse(baked_test_ca)
## Rows: 7,725
## Columns: 15
## $ total_population <dbl> 1713, 1322, 1206, 1750, 2678, 748, 595, 150…
## $ total_households <dbl> 644, 630, 423, 591, 1171, 241, 231, 758, 41…
## $ average_household_income <dbl> 150520.19, 167222.22, 167677.30, 149141.29,…
## $ average_house_age <dbl> 23.5, 16.7, 36.9, 37.4, 39.7, 40.1, 36.4, 2…
## $ total_rooms <dbl> 5300, 4535, 2583, 3084, 6155, 1334, 1404, 3…
## $ median_house_value <dbl> 14.30599, 13.93489, 14.18528, 14.03641, 13.…
## $ longitude <dbl> -122.2344, -122.2222, -122.2480, -122.2587,…
## $ name_Los.Angeles.County <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ name_Orange.County <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ name_Riverside.County <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ name_San.Diego.County <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ocean_proximity_X.1H.OCEAN <dbl> 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0…
## $ ocean_proximity_INLAND <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ocean_proximity_NEAR.BAY <dbl> 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1…
## $ ocean_proximity_NEAR.OCEAN <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
Here, we bake the training dataset before fitting it to the linear model.
lm_CA_model <- parsnip::linear_reg() |>
parsnip::set_engine("lm") |>
set_mode("regression")
CA_wflow <- workflows::workflow() |>
workflows::add_recipe(housing_rec) |>
workflows::add_model(lm_CA_model)
CA_wflow_fit <- fit(CA_wflow, data = train_ca)
wflowoutput <- CA_wflow_fit |>
extract_fit_parsnip() |>
broom::tidy()
wflowoutput
## # A tibble: 15 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 5.03 0.295 17.0 1.97e- 64
## 2 total_population -0.000163 0.00000736 -22.1 2.12e-106
## 3 total_households 0.000602 0.0000299 20.1 1.13e- 88
## 4 average_household_income 0.0000132 0.000000117 112. 0
## 5 average_house_age 0.0133 0.000494 27.0 5.87e-157
## 6 total_rooms 0.000000190 0.00000467 0.0407 9.68e- 1
## 7 longitude -0.0534 0.00248 -21.5 3.02e-101
## 8 name_Los.Angeles.County 0.419 0.0101 41.6 0
## 9 name_Orange.County 0.320 0.0142 22.5 1.25e-110
## 10 name_Riverside.County 0.151 0.0159 9.53 1.74e- 21
## 11 name_San.Diego.County 0.260 0.0143 18.2 6.04e- 73
## 12 ocean_proximity_X.1H.OCEAN -0.0264 0.0103 -2.56 1.04e- 2
## 13 ocean_proximity_INLAND -0.148 0.00991 -15.0 2.50e- 50
## 14 ocean_proximity_NEAR.BAY 0.0988 0.0150 6.56 5.44e- 11
## 15 ocean_proximity_NEAR.OCEAN NA NA NA NA
CA_wflow_fit |>
extract_fit_parsnip() |>
vip::vip(num_features = 10)
After preparing and baking our training dataset with the previous
recipe, we set our model to “Linear Regression” and then fit our dataset
to the model. We then extract the top 10 features from the dataset that
determine what features are the most important in determining house
value. According to the output, we find that the most important feature
in determining house value is average_household_income
,
which makes sense given that the more someone makes, the pricier their
house may be.
wf_fit <- CA_wflow_fit |>
extract_fit_parsnip()
wf_fitted_values <-
broom::augment(wf_fit[["fit"]], data = baked_train) |>
select(median_house_value, .fitted:.std.resid)
yardstick::metrics(wf_fitted_values,
truth = median_house_value, estimate = .fitted)
## # A tibble: 3 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 0.379
## 2 rsq standard 0.655
## 3 mae standard 0.275
ggplot(wf_fitted_values, aes(x = median_house_value, y = .fitted)) +
geom_point(alpha = 0.5, color = "blue") + # Scatter plot points
geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") + # Perfect prediction line
labs(
title = "Predicted vs Actual Values",
x = "Actual Median House Value (log-transformed)",
y = "Predicted Median House Value"
) +
theme_minimal()
Above, we look at the model’s performance. The Root Mean Squared Error, a metric that determines a model’s overall performance by gauging the average difference between a value and a model’s predicted output comes to around 0.3785. We suspect that the model outputs this error because of the right-skewed distribution of the median house values, with a lot of values that are above 2,000,000. To alleviate, the model would then only consider prices at the 97th percentile and below (\(<2,000,000\)). Additionally, the R-squared value for this model comes out to \(0.61\), or \(61\%\).
# Re-filtering the data to only consider less affluent housing prices.
price_threshold <- quantile(ca_housing$median_house_value, 0.97)
ca_housing <- ca_housing %>%
filter(median_house_value <= price_threshold)
#Log transforming to get a better distribution
#ca_housing <- ca_housing |>
#mutate(median_house_value = log(median_house_value))
# Re-splitting the data into training and testing sets
ca_split <- rsample::initial_split(data = ca_housing, prop = 2/3)
train_ca <- rsample::training(ca_split)
test_ca <- rsample::testing(ca_split)
housing_rec <- recipe(train_ca) |>
update_role(everything(), new_role = "predictor") |>
update_role(median_house_value, new_role = "outcome") |>
step_dummy(name, ocean_proximity, one_hot = TRUE) |>
step_corr(all_numeric()) |>
step_nzv(all_numeric())
Here, we re-run the model, applying the previous explanation of looking at all housing values (at most) in the 97th percentile.
# Prepping and training the model
prepped_rec <- prep(housing_rec, verbose = TRUE, retain = TRUE)
## oper 1 step dummy [training]
## oper 2 step corr [training]
## oper 3 step nzv [training]
## The retained training set is ~ 1.83 Mb in memory.
baked_train <- bake(prepped_rec, new_data = NULL)
baked_test_ca <- bake(prepped_rec, new_data = test_ca)
lm_CA_model <- parsnip::linear_reg() |>
parsnip::set_engine("lm") |>
set_mode("regression")
CA_wflow <- workflows::workflow() |>
workflows::add_recipe(housing_rec) |>
workflows::add_model(lm_CA_model)
CA_wflow_fit <- fit(CA_wflow, data = train_ca)
wflowoutput <- CA_wflow_fit |>
extract_fit_parsnip() |>
broom::tidy()
wflowoutput
## # A tibble: 16 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 4.85 0.326 14.9 1.08e- 49
## 2 total_population -0.000137 0.00000743 -18.5 1.56e- 75
## 3 total_households 0.000592 0.0000305 19.4 1.42e- 82
## 4 average_household_income 0.0000127 0.000000120 106. 0
## 5 average_house_age 0.0126 0.000498 25.3 7.90e-138
## 6 total_rooms -0.00000883 0.00000480 -1.84 6.60e- 2
## 7 longitude -0.0551 0.00273 -20.2 1.41e- 89
## 8 name_Los.Angeles.County 0.412 0.0110 37.4 1.88e-292
## 9 name_Orange.County 0.321 0.0148 21.6 5.09e-102
## 10 name_Riverside.County 0.146 0.0168 8.69 4.11e- 18
## 11 name_San.Bernardino.County 0.00992 0.0161 0.614 5.39e- 1
## 12 name_San.Diego.County 0.275 0.0151 18.2 4.57e- 73
## 13 ocean_proximity_X.1H.OCEAN -0.0185 0.0104 -1.79 7.35e- 2
## 14 ocean_proximity_INLAND -0.142 0.00999 -14.2 1.82e- 45
## 15 ocean_proximity_NEAR.BAY 0.0917 0.0156 5.87 4.36e- 9
## 16 ocean_proximity_NEAR.OCEAN NA NA NA NA
CA_wflow_fit |>
extract_fit_parsnip() |>
vip::vip(num_features = 10)
# Looking at the results from excluding anything above the 97th percentile
wf_fit <- CA_wflow_fit |>
extract_fit_parsnip()
wf_fitted_values <-
broom::augment(wf_fit[["fit"]], data = baked_train) |>
select(median_house_value, .fitted:.std.resid)
yardstick::metrics(wf_fitted_values, truth = median_house_value, estimate = .fitted)
## # A tibble: 3 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 0.377
## 2 rsq standard 0.621
## 3 mae standard 0.269
Like the original mode, we bake and train the model excluding
everything above the 97th percentile, examining the results. This model
has a slightly lower RMSE than the previous one, but with roughly the
same R-squared value. Like the original linear regression model, the
most important feature remains the
average_household_income
.
In order to make a comparison to the linear regression model above, we apply our dataset to a Random Forest algorithm; however, we include all parts of the dataset (> 97th percentile) to see the general accuracy.
ca_housing <- read_csv("data/california_housing_updated.csv")
ca_housing <- ca_housing[!apply(ca_housing == -666666666, 1, any), ]
# Refactoring county name and ocean_proximity
ca_housing <- ca_housing |>
mutate(across(c(name, ocean_proximity), as.factor))
# Splitting the data into training and testing sets
ca_split <- rsample::initial_split(data = ca_housing, prop = 2/3)
train_ca <- rsample::training(ca_split)
test_ca <- rsample::testing(ca_split)
RF_rec <- recipe(train_ca) |>
update_role(everything(), new_role = "predictor") %>%
update_role(median_house_value, new_role = "outcome") %>%
step_string2factor("ocean_proximity") %>%
step_dummy(all_nominal_predictors()) %>%
# Handle correlation
step_corr(all_numeric(), threshold = 0.8) %>%
# Remove near-zero variance
step_nzv(all_numeric()) %>%
# Normalize numeric predictors
step_normalize(all_numeric_predictors())
# install.packages("randomForest")
tune_RF_model <- rand_forest(mtry = tune(), min_n = tune()) |>
set_engine("randomForest") |>
set_mode("regression")
tune_RF_model
## Random Forest Model Specification (regression)
##
## Main Arguments:
## mtry = tune()
## min_n = tune()
##
## Computational engine: randomForest
# Using Cross-fold validation for better model performance.
vfold_ca <- rsample::vfold_cv(data = train_ca, v = 4)
RF_tune_wflow <- workflows::workflow() |>
workflows::add_recipe(RF_rec) |>
workflows::add_model(tune_RF_model)
n_cores <- parallel::detectCores()
doParallel::registerDoParallel(cores = n_cores)
tune_RF_results <- tune_grid(object = RF_tune_wflow, resamples=vfold_ca, grid = 20)
tune_RF_results
## # Tuning results
## # 4-fold cross-validation
## # A tibble: 4 × 4
## splits id .metrics .notes
## <list> <chr> <list> <list>
## 1 <split [11587/3863]> Fold1 <tibble [40 × 6]> <tibble [0 × 3]>
## 2 <split [11587/3863]> Fold2 <tibble [40 × 6]> <tibble [0 × 3]>
## 3 <split [11588/3862]> Fold3 <tibble [40 × 6]> <tibble [0 × 3]>
## 4 <split [11588/3862]> Fold4 <tibble [40 × 6]> <tibble [1 × 3]>
##
## There were issues with some computations:
##
## - Warning(s) x1: the standard deviation is zero, The correlation matrix has missin...
##
## Run `show_notes(.Last.tune.result)` for more information.
Here, we utilize a similar formula as the Linear Regression model, but adjusting the algorithm used, and adjusting the recipe according to the Random Forest implementation.
tune_RF_results |>
collect_metrics()
## # A tibble: 40 × 8
## mtry min_n .metric .estimator mean n std_err .config
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 4 39 rmse standard 226228. 4 4928. Preprocessor1_Mod…
## 2 4 39 rsq standard 0.703 4 0.00593 Preprocessor1_Mod…
## 3 6 15 rmse standard 222977. 4 5178. Preprocessor1_Mod…
## 4 6 15 rsq standard 0.709 4 0.00729 Preprocessor1_Mod…
## 5 8 31 rmse standard 225946. 4 5446. Preprocessor1_Mod…
## 6 8 31 rsq standard 0.701 4 0.00806 Preprocessor1_Mod…
## 7 3 19 rmse standard 229811. 4 5144. Preprocessor1_Mod…
## 8 3 19 rsq standard 0.699 4 0.00579 Preprocessor1_Mod…
## 9 7 28 rmse standard 224745. 4 5230. Preprocessor1_Mod…
## 10 7 28 rsq standard 0.704 4 0.00763 Preprocessor1_Mod…
## # ℹ 30 more rows
show_best(tune_RF_results, metric = "rmse", n = 1)
## # A tibble: 1 × 8
## mtry min_n .metric .estimator mean n std_err .config
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 5 7 rmse standard 222035. 4 5078. Preprocessor1_Model20
After running the Random Forest model, we find that the RMSE value (with \(n=3\)) is \(222983.1\).
tuned_RF_values<- select_best(tune_RF_results)
tuned_RF_values
## # A tibble: 1 × 3
## mtry min_n .config
## <int> <int> <chr>
## 1 5 7 Preprocessor1_Model20
# specify best combination from tune in workflow
RF_tuned_wflow <-RF_tune_wflow |>
tune::finalize_workflow(tuned_RF_values)
# fit model with those parameters on train AND test
overallfit <- RF_tuned_wflow |>
tune::last_fit(ca_split)
collect_metrics(overallfit)
## # A tibble: 2 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 rmse standard 224931. Preprocessor1_Model1
## 2 rsq standard 0.692 Preprocessor1_Model1
test_predictions <- collect_predictions(overallfit)
The RMSE for this model is around \(~215,000\), with the R-squared value at \(.73\), which is higher than the Linear Regression model.
overallfit |>
extract_fit_parsnip() |>
vip::vip(num_features = 10)
# fit data
RF_final_train_fit <- parsnip::fit(RF_tuned_wflow, data = train_ca)
RF_final_test_fit <- parsnip::fit(RF_tuned_wflow, data = test_ca)
# get predictions on training data
values_pred_train <- predict(RF_final_train_fit, train_ca) |>
bind_cols(train_ca |> select(median_house_value, name, ocean_proximity))
# get predictions on testing data
values_pred_test <- predict(RF_final_test_fit, test_ca) |>
bind_cols(test_ca |> select(median_house_value, name, ocean_proximity))
After running the Random Forest model on both the training and
testing data, and selecting median_house_value
,
name
, and ocean_proximity
, we find the most
important feature continues to remain
average_household_income
, followed by
longitude
and average_house_age
.
To visualize our models’ performance, we plotted the actual values of median house value for our data in the test set against each model’s predicted values.
wf_fitted_values |>
ggplot(aes(x = median_house_value, y = .fitted)) +
geom_point(alpha = 0.5, color = "#2C3E50") +
geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed", size = 1) +
labs(
title = "Predicted vs. Actual Median House Values",
subtitle = "Perfect predictions would fall on the dashed red line",
x = "Predicted Values ($)",
y = "Actual Values ($)",
caption = "Model used is linear regression"
) +
theme_minimal() +
theme(
plot.title = element_text(size = 14, face = "bold"),
plot.subtitle = element_text(size = 10, color = "gray40"),
panel.grid.major = element_line(color = "gray90"),
panel.grid.minor = element_line(color = "gray95"),
axis.text = element_text(size = 10),
axis.title = element_text(size = 12)
)
yardstick::metrics(wf_fitted_values, truth = median_house_value, estimate = .fitted)
## # A tibble: 3 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 0.377
## 2 rsq standard 0.621
## 3 mae standard 0.269
To reiterate, the RMSE of this model is \(~211,000\) whilst its R-squared value is \(.61\).
This is our graph of actual values of median house value versus the predicted values in our RF model.
ggplot(values_pred_test, aes(x=median_house_value, y=.pred))+
geom_point(alpha = 0.5, color = "#2C3E50") +
geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed", size = 1) +
labs(
title = "Predicted vs. Actual Median House Value for Random Forest Modelling",
subtitle = "Perfect predictions would fall on the dashed red line",
x = "Actual Values ($)",
y = "Predicted Values ($)",
caption = "Model used is Random Forest"
) +
theme_minimal() +
theme(
plot.title = element_text(size = 14, face = "bold"),
plot.subtitle = element_text(size = 10, color = "gray40"),
panel.grid.major = element_line(color = "gray90"),
panel.grid.minor = element_line(color = "gray95"),
axis.text = element_text(size = 10),
axis.title = element_text(size = 12)
)
collect_metrics(overallfit)
## # A tibble: 2 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 rmse standard 224931. Preprocessor1_Model1
## 2 rsq standard 0.692 Preprocessor1_Model1
The RMSE for the Random Forest model is \(215,000\), whilst its R-squared value is at \(.73\).
In comparing both models, we find that Random Forest performs better than the Linear Regression in predicting based on its respective R-Squared value of \(0.73\) (compared to LR’s \(0.61\)). When comparing the predicted and actual house values visually for both models, the Random Forest’s data points are closer to the “perfect predictions” than Linear Regression, as the latter clusters further on the x-axis; however, the Random Forest fails to properly predict any value above \(2,000,000\), as those values are outliers not accounted for by the implemented model. This limitation could be attributed to the inherent nature of ensemble tree-based models like Random Forest, which struggle with extreme outliers due to their decision tree structure. Tree-based models tend to create splits based on the majority of the data, making them less sensitive to extreme values that are far from the central distribution. The linear model appears more uniform because it attempts to create a global linear relationship across all data points, whereas the Random Forest creates multiple decision trees that can fragment the prediction space. Furthermore, the Random Forest has a higher RMSE of \(215,000\), whereas Linear Regression has \(~211,000\), indicating that the Linear Regression model is more accurate in predicting overall house value in California.
Overall, the important features of each model remain relatively the
same across the board, with average_household_income
remaining the dominant feature for both Linear Regression and Random
Forest. For both models, this dominant feature makes sense given that
most individuals with a high household income would typically have
pricier living conditions (and vice versa). This is visible in the EDA,
where both median_housing_value
and
average_household_income
are concentrated within certain
areas such as the Bay Area and San Diego. More so,
ocean_proximity
does also play a significant part in both
Linear Regression and Random Forest, as most “beachfront” and
“bay-front” properties are often pricier than most houses, given their
prime locations in the state.
In our analysis of our California Housing dataset, one limiting
factor that arose were the several outliers present within the dataset,
specifically within the median_housing_value
factor. As
mentioned previously, the Random Forest (and initial Linear Regression)
model failed to properly identify and predict any values higher than
2,000,000 (the 97th percentile). Since these values could not be
normalized, their predictions become less reliable, potentially due to
the models’ inability to capture the extreme variations in high-value
property markets, such as those in premium coastal areas or tech hub
regions like Silicon Valley.
The study successfully developed two distinct predictive models—Linear Regression and Random Forest—to estimate median house values in California. While both models demonstrated similar root mean square error (RMSE) of approximately \(210,000\), the Random Forest model distinguished itself with a notably higher R-squared value of \(0.73\), compared to the Linear Regression model’s \(0.61\). This higher R-squared indicates that the Random Forest model explains 73% of the variance in median house values, suggesting a more nuanced and comprehensive capture of the complex relationships within the California housing market.
The model performances highlight the dataset’s complexity,
particularly in capturing high-value property variations (as shown with
the several outliers). Although the Random Forest model showed superior
explanatory power, both models consistently identified key predictors
such as average_household_income
and
ocean_proximity
as critical factors influencing house
values. These findings not only provide insights into California’s
housing market dynamics, but also demonstrate the potential of machine
learning techniques in real estate price prediction.
Future research could focus on addressing the models’ limitations, particularly their challenges with high-value property outliers, potentially through advanced techniques like XGBoost or GMMs.
Tikkanen, A. (2024, January 17). list of U.S. states by population. Encyclopedia Britannica. https://www.britannica.com/topic/largest-U-S-state-by-population↩︎
California’s Population is Increasing. (2024, April 30). California Governor; CAWeb Publishing Service. https://www.gov.ca.gov/2024/04/30/californias-population-is-increasing/.↩︎
Zillow. (2024). San Francisco, CA Housing Market. https://www.zillow.com/home-values/20330/san-francisco-ca/.↩︎
Bohn, S., & Duan, J. (2024, October 14). California’s Economy. Public Policy Institute of California. https://www.ppic.org/publication/californias-economy/.↩︎
Nicolaides, B. M. (2023). The Historic Suburban Landscapes of Los Angeles. Oxford University Press EBooks, 19–68. https://doi.org/10.1093/oso/9780197578308.003.0002↩︎