Air pollution is a major global health concern, causing 1 in 9 deaths worldwide in 20181. Across the various types of pollution, particulate matter is the most lethal, ranking as the sixth leading cause of death worldwide2. Given its significant impact on human health, determining the amount of particulate matter pollution across the United States is crucial for identifying high-risk areas and learning more about its effects on human health. Factors like poverty and education levels are commonly associated with unequal pollution exposure, but their influence may vary within different geographical contexts. Urban and rural areas differ in a several aspects, such as population density, land use, and pollution sources which could impact the relationship of these factors on exposure levels. In this case study, we investigate to what accuracy are we able to predict annual air pollution concentrations within the US using machine learning methods. Additionally, we investigate whether poverty and education impact pollution exposure differently in urban and rural areas.
With what accuracy can we predict US annual average air pollution concentrations?
Does the impact of poverty and education levels on pollution exposure differ between urban and rural areas?
Here, we import packages tidyverse
,
tidymodels
, and conflicted
to clean, model,
and predict the data.
library(tidyverse)
library(tidymodels)
library(conflicted)
The data that we’re examining consists of several monitors across the
U.S. that examine air pollution within the area. This is denoted by
variable pm
. A detailed list of the columns within
pm
are listed below.
id
: Monitor numberfips
: Federal information processing standard number
for the county where the monitor is located. Alaska and Hawaii are
excluded.Lat
: Latitude of monitorLong
: Longitude of monitorstate
: State where monitor is locatedcounty
: County where monitor is locatedcity
: City where monitor is locatedCMAQ
: Estimated values of air pollution from a
computational model called Community Multiscale Air Quality (CMAQ)zcta
: Zip Code Tabulation Area where the monitor is
locatedzcta_area
: Land area of the zip code area in meters
squaredzcta_pop
: Population of ZCTA where monitor is
locatedimp_a500
: Impervious surface measure within a circle
with a radius of 500 meters around the monitorimp_a1000
: Impervious surface measure within a circle
with a radius of 1000 meters around the monitorimp_a10000
: Impervious surface measure within a circle
with a radius of 10000 meters around the monitorimp_a15000
: Impervious surface measure within a circle
with a radius of 15000 meters around the monitorcounty_area
: Land area of the county of the monitor in
meters squaredcounty_pop
: Population of the county of the
monitorLog_dist_to_prisec
: Natural log distance to a primary
or secondary road from the monitor highway or major roadlog_pri_length_5000
: Count of primary road length in
meters in a circle with a radius of 5000 meters around the monitor
(highways only)log_pri_length_10000
: Count of primary road length in
meters in a circle with a radius of 10000 meters around the monitor
(highways only)log_pri_length_15000
: Count of primary road length in
meters in a circle with a radius of 15000 meters around the monitor
(highways only)log_pri_length_25000
: Count of primary road length in
meters in a circle with a radius of 25000 meters around the monitor
(highways only)log_prisec_length_500
: Count of primary and secondary
road length in meters in a circle with a radius of 500 meters around the
monitor (highways + secondary roads)log_prisec_length_1000
: Count of primary and secondary
road length in meters in a circle with a radius of 1000 meters around
the monitor (highways + secondary roads)log_prisec_length_5000
: Count of primary and secondary
road length in meters in a circle with a radius of 5000 meters around
the monitor (highways + secondary roads)log_prisec_length_10000
: Count of primary and secondary
road length in meters in a circle with a radius of 10000 meters around
the monitor (highways + secondary roads)log_prisec_length_15000
: Count of primary and secondary
road length in meters in a circle with a radius of 15000 meters around
the monitor (highways + secondary roads)log_prisec_length_25000
: Count of primary and secondary
road length in meters in a circle with a radius of 25000 meters around
the monitor (highways + secondary roads)log_nei_2008_pm25_sum_10000
: Tons of emissions from
major sources data base (annual data) sum of all sources within a circle
with a radius of 10000 meters of distance around the monitor (Natural
log)log_nei_2008_pm25_sum_15000
: Tons of emissions from
major sources data base (annual data) sum of all sources within a circle
with a radius of 15000 meters of distance around the monitor (Natural
log)log_nei_2008_pm25_sum_25000
: Tons of emissions from
major sources data base (annual data) sum of all sources within a circle
with a radius of 25000 meters of distance around the monitor (Natural
log)log_nei_2008_pm10_sum_10000
: Tons of emissions from
major sources data base (annual data) sum of all sources within a circle
with a radius of 10000 meters of distance around the monitor (Natural
log)log_nei_2008_pm10_sum_15000
: Tons of emissions from
major sources data base (annual data) sum of all sources within a circle
with a radius of 15000 meters of distance around the monitor (Natural
log)log_nei_2008_pm10_sum_25000
: Tons of emissions from
major sources data base (annual data) sum of all sources within a circle
with a radius of 25000 meters of distance around the monitor (Natural
log)popdens_county
: Population density (number of people
per kilometer squared area of the county)popdens_zcta
: Population density (number of people per
kilometer squared area of zcta)nohs
: Percentage of people in zcta area where the
monitor is that do not have a high school degreesomehs
: Percentage of people in zcta area where the
monitor whose highest formal educational attainment was some high school
educationhs
: Percentage of people in zcta area where the monitor
whose highest formal educational attainment was completing a high school
degreesomecollege
: Percentage of people in zcta area where
the monitor whose highest formal educational attainment was completing
some college educationassociate
: Percentage of people in zcta area where the
monitor whose highest formal educational attainment was completing an
associate degreebachelor
: Percentage of people in zcta area where the
monitor whose highest formal educational attainment was a bachelor’s
degreegrad
: Percentage of people in zcta area where the
monitor whose highest formal educational attainment was a graduate
degreepov
: Percentage of people in zcta area where the
monitor is that lived in poverty in 2008 - or would it have been 2007 sourcehs_orless
: Percentage of people in zcta area where the
monitor whose highest formal educational attainment was a high school
degree or less (sum of nohs, somehs, and hs)urc2013
: 2013 Urban-rural classification of the county
where the monitor is located, where 1 is totally urban 6 is completely
rural (categories)urc2006
: 2006 Urban-rural classification of the county
where the monitor is located; works similarly to
urc2013
aod
: Aerosol Optical Depth measurement from a NASA
satellite; unit-less and acquired from NASApm <- read_csv("data/pm25_data.csv")
In general, the data contains various metrics of pollution such as
value
, CMAQ
, aod
, etc. There are
other pollution-related variables that provide more context to these
metrics, such as impervious surface measurements, education
demographics, and urban classification of areas surveyed by monitors.
The group constructed a correlation plot to explore variables that may
have correlations to each other.
# This is the correlation plot of all the variables in the pm dataset.
PM_cor <- cor(pm |> dplyr::select_if(is.numeric))
corrplot::corrplot(PM_cor, tl.cex = 0.5)
The group is also exploring the interaction effects of urban-rural classification of areas monitored against other pollution metrics. This is a graph of Aerosol Optical Depth Measurement against various classifications of urbanity.
#AOD plotting aod against urban-rural classification.
urban_plot <- ggplot(pm, aes(x = as.factor(urc2013), y = aod)) +
geom_boxplot(aes(fill = as.factor(urc2013))) +
scale_fill_viridis_d() +
labs(
title = "Pollution Levels by Urban-Rural Classification",
x = "Urban-Rural Code (2013)",
y = "Aerosol Optical Depth Measurement",
fill = "URC"
)
urban_plot
We analyze the education percentages of the population (categorized
by degrees) and the average detected emissions around each monitor. In
this visualization, we see that there is some negative correlation
between “higher” degrees (e.g. ≥ hs
) and pollution, and a
positive correlation between “lower” degrees and pollution (e.g. ≤
somehs
); however, since population is not included in this
correlation plot, we can’t be too sure on the actual value of just
pollution and education levels per area.
long_data <- pm |>
select(value, nohs:grad) |>
pivot_longer(cols = -value,
names_to = "education",
values_to = "percentage")
ggplot(long_data, aes(x = percentage, y = value)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", color = "#2c7fb8") +
facet_wrap(~education, scales = "free_x") +
labs(title = "Average Emissions vs. Education Levels",
x = "Education percentage",
y = "Average detected emissions") +
theme_minimal()
Furthermore, it also seems that the data are clustered heavily to the
left side of each plot, so we’ll check the distribution of another
feature, pov
, as poverty levels can be related to the
levels of education per area.
ggplot(pm, aes(pov)) +
geom_histogram() +
labs(title = "Examining the distribution of POV") +
theme_minimal()
Looking at the pov
distribution from the data, there’s a
right-skew in the data, signfiying that poorer populations are
overrepresented in the dataset.
pm <- mutate(pm, pov=log(pov))
ggplot(pm, aes(pov)) +
geom_histogram() +
labs(title = "Examining the distribution of POV",
subtitle = "Log-transformed POV with log_pov") +
theme_minimal()
We normalize the pov
feature in the dataset by
log-transforming the values. We then check if the distribution is
normalized so we can use it in further EDA and analysis later on.
ggplot(pm, aes(x=pov, y=value)) +
geom_point() + geom_smooth(method= 'lm', color = "#2c7fb8") +
facet_wrap(~as.factor(urc2013)) +
labs(title = "Relationship between poverty and pollution levels",
x = "Grouped percentages of people in poverty",
y = "Average detected pollution") +
theme_minimal()
Utilizing the normalized pov
on examining the average
detected pollution, we see that the data are more “evenly” distributed
amongst grouped poverty levels; however, the correlations for the levels
1 (urbanized) to 6 (rural) are fairly neutral, with level 3 having a
slight positive correlation to average detected pollution levels.
ggplot(pm, aes(popdens_zcta)) +
geom_histogram() +
labs(title = "Examining the distribution of popdens_zcta") +
theme_minimal()
Additionally, the group would also like to see the distribution of the log-transformed population density against each urban-rural classification.
pm <- mutate(pm, popdens_zcta=log(popdens_zcta))
ggplot(pm, aes(x=popdens_zcta)) +
geom_histogram() +
facet_wrap(~as.factor(urc2013)) + labs(title='Log-transformed popdens_zcta for each urc2013 value')
With these transformations, the group is also exploring the relationships of other pollution metrics (CMAQ and AOD) against population density. There appears to be a positive relationship between both variables, suggesting that population density may be a good performer in predictive analysis models.
#Mapping the relationship between the log-transformed pollution measurements and log-transformed population density.
ggplot(pm, aes(x = popdens_zcta)) +
geom_point(aes(y = log(aod))) +
geom_smooth(aes(y = log(aod)), method = "loess") +
labs(
title = "AOD vs Population Density",
x = "Log Population Density",
y = "Pollution Measures",
) +
theme_minimal()
ggplot(pm, aes(x = popdens_zcta)) +
geom_point(aes(y = log(CMAQ))) +
geom_smooth(aes(y = log(CMAQ)), method = "loess") +
labs(
title = "CMAQ vs Population Density",
x = "Log Population Density",
y = "Pollution Measures",
) +
theme_minimal()
Finally, the group is also mapping correlations of other population metrics and pollution metrics.
pm |>
mutate(popdens_county=log(popdens_county),
county_pop = log(county_pop)) |>
select(county_pop, popdens_county, pov, aod, CMAQ)|>
GGally::ggpairs()+
ggtitle("Correlations between County Population, Poverty, County Population Density, AOD, and CMAQ")
Final recipe for the linear regression model. This model sets value
as the outcome variable, with every variable set as the predictor.
Afterwards, the model sets variables id
and
fips
as id variables in order for them to not be used in
model prediction. The model also one-hot encodes categorical values such
as state
, county
, city
, and
zcta
in order for them to be properly used in a linear
regression model.
#Re-reading the data. Just to see if starting with a clean dataset would make the model run correctly
pm_original <- read_csv("data/pm25_data.csv")
# Refactoring ID, fips, and zcta as factors.
pm_original <- pm_original |>
mutate(across(c(id, fips, zcta), as.factor)) |>
mutate(city = case_when(city == "Not in a city" ~ "Not in a city",
city != "Not in a city" ~ "In a city"))
# Splitting the data into training/testing sets.
set.seed(42)
pm_split <- rsample::initial_split(data = pm_original, prop = 2/3)
train_pm <- rsample::training(pm_split)
test_pm <- rsample::testing(pm_split)
# Final recipe for the linear regression model. This one-hot encodes categorical values such as state, county, city, and zcta in order
novel_rec <- recipe(train_pm) |>
update_role(everything(), new_role = "predictor") |>
update_role(value, new_role = "outcome") |>
update_role(id, new_role = "id variable") |>
update_role("fips", new_role = "county id") |>
step_dummy(state, county, city, zcta, one_hot = TRUE) |>
step_corr(all_numeric()) |>
step_nzv(all_numeric())
Here, the novel_rec
created in the previous cell is then
prepped using the tidymodels
package, and then baked into
both the baked_train
and baked_test_pm
splits
of the pm
dataset.
prepped_rec <- prep(novel_rec, verbose = TRUE, retain = TRUE)
## oper 1 step dummy [training]
## oper 2 step corr [training]
## oper 3 step nzv [training]
## The retained training set is ~ 0.27 Mb in memory.
baked_train <- bake(prepped_rec, new_data = NULL)
baked_test_pm <- bake(prepped_rec, new_data = test_pm)
glimpse(baked_test_pm)
## Rows: 292
## Columns: 39
## $ id <fct> 1003.001, 1049.1003, 1055.001, 1069.0003, …
## $ value <dbl> 9.597647, 11.659091, 12.375394, 10.508850,…
## $ fips <fct> 1003, 1049, 1055, 1069, 1073, 1089, 1097, …
## $ lat <dbl> 30.49800, 34.28763, 33.99375, 31.22636, 33…
## $ lon <dbl> -87.88141, -85.96830, -85.99107, -85.39077…
## $ CMAQ <dbl> 8.098836, 8.534772, 9.241744, 9.121892, 9.…
## $ zcta_area <dbl> 190980522, 203836235, 154069359, 162685124…
## $ zcta_pop <dbl> 27829, 8300, 20045, 30217, 21725, 21297, 2…
## $ imp_a500 <dbl> 0.01730104, 5.78200692, 16.49307958, 19.13…
## $ imp_a15000 <dbl> 1.4386207, 0.9730444, 5.1612102, 4.7401296…
## $ county_area <dbl> 4117521611, 2012662359, 1385618994, 150173…
## $ county_pop <dbl> 182265, 71109, 104430, 101547, 658466, 334…
## $ log_dist_to_prisec <dbl> 4.648181, 3.721489, 5.261457, 7.112373, 4.…
## $ log_pri_length_5000 <dbl> 8.517193, 8.517193, 9.066563, 8.517193, 8.…
## $ log_pri_length_25000 <dbl> 11.32735, 11.90959, 12.01356, 10.12663, 12…
## $ log_prisec_length_500 <dbl> 7.295356, 7.310155, 8.740680, 6.214608, 8.…
## $ log_prisec_length_1000 <dbl> 8.195119, 8.585843, 9.627898, 7.600902, 9.…
## $ log_prisec_length_5000 <dbl> 10.815042, 10.214200, 11.728889, 12.298627…
## $ log_prisec_length_10000 <dbl> 11.88680, 11.50894, 12.76828, 12.99414, 11…
## $ log_nei_2008_pm10_sum_10000 <dbl> 1.35588511, 0.00000000, 4.43719884, 0.9288…
## $ log_nei_2008_pm10_sum_15000 <dbl> 2.267834, 3.350044, 4.462679, 3.674739, 5.…
## $ log_nei_2008_pm10_sum_25000 <dbl> 5.628728, 5.171920, 4.678311, 3.744629, 8.…
## $ popdens_county <dbl> 44.265706, 35.330814, 75.367038, 67.619664…
## $ popdens_zcta <dbl> 145.71643, 40.71896, 130.10374, 185.73917,…
## $ nohs <dbl> 3.3, 14.3, 4.3, 5.8, 2.8, 1.2, 5.3, 23.9, …
## $ somehs <dbl> 4.9, 16.7, 13.3, 11.6, 8.2, 3.1, 14.2, 17.…
## $ hs <dbl> 25.1, 35.0, 27.8, 29.8, 32.0, 15.1, 35.7, …
## $ somecollege <dbl> 19.7, 14.9, 29.2, 21.4, 28.9, 20.5, 23.8, …
## $ associate <dbl> 8.2, 5.5, 10.1, 7.9, 7.4, 6.5, 8.3, 4.3, 5…
## $ bachelor <dbl> 25.3, 7.9, 10.0, 13.7, 13.5, 30.4, 8.3, 4.…
## $ grad <dbl> 13.5, 5.8, 5.4, 9.8, 7.1, 23.3, 4.5, 2.0, …
## $ pov <dbl> 6.1, 13.8, 8.8, 15.6, 5.1, 5.2, 18.4, 29.5…
## $ hs_orless <dbl> 33.3, 66.0, 45.4, 47.2, 43.0, 19.4, 55.2, …
## $ urc2013 <dbl> 4, 6, 4, 4, 1, 3, 3, 1, 1, 3, 2, 2, 6, 2, …
## $ aod <dbl> 37.36364, 33.08333, 43.41667, 33.00000, 36…
## $ state_California <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ state_Illinois <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ state_Ohio <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ city_Not.in.a.city <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
Below, a linear model is set up to examine the effects of
pm
’s data based on the examined factors of
state
, county
, and zcta
, along
with other variables id
and fips
.
lm_PM_model <- parsnip::linear_reg() |>
parsnip::set_engine("lm") |>
set_mode("regression")
PM_wflow <- workflows::workflow() |>
workflows::add_recipe(novel_rec) |>
workflows::add_model(lm_PM_model)
PM_wflow_fit <- fit(PM_wflow, data = train_pm)
wflowoutput <- PM_wflow_fit |>
extract_fit_parsnip() |>
broom::tidy()
wflowoutput
## # A tibble: 37 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.82e+ 2 1.16e+ 2 1.58 0.116
## 2 lat 3.65e- 2 2.34e- 2 1.55 0.121
## 3 lon 3.18e- 2 9.33e- 3 3.41 0.000698
## 4 CMAQ 2.57e- 1 4.23e- 2 6.08 0.00000000225
## 5 zcta_area -2.96e-10 2.24e-10 -1.32 0.186
## 6 zcta_pop 8.92e- 6 5.22e- 6 1.71 0.0876
## 7 imp_a500 8.17e- 3 7.06e- 3 1.16 0.247
## 8 imp_a15000 -3.39e- 3 1.17e- 2 -0.289 0.773
## 9 county_area -1.33e-11 1.59e-11 -0.831 0.406
## 10 county_pop -6.70e- 8 8.59e- 8 -0.780 0.436
## # ℹ 27 more rows
PM_wflow_fit |>
extract_fit_parsnip() |>
vip::vip(num_features = 10)
Based on the generated visualization, the variable state
with the category of California is the most important feature found in
the linear model created, following that of CMAQ
and
aod
.
The following code block assesses our model performance. The Root Mean Squared Error (RMSE), is a metric that determines a model’s overall performance by gauging the average difference between a value and a model’s predicted output comes to around 1.89. This means that on average, the linear regression model will predict a value that is around 1.89 ug/m3 off the true value of pollution for that area.
wf_fit <- PM_wflow_fit |>
extract_fit_parsnip()
wf_fitted_values <-
broom::augment(wf_fit[["fit"]], data = baked_train) |>
select(value, .fitted:.std.resid)
yardstick::metrics(wf_fitted_values,
truth = value, estimate = .fitted)
## # A tibble: 3 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 1.89
## 2 rsq standard 0.456
## 3 mae standard 1.41
We then employ cross-fold validation below, a technique that makes multiple variations of data partitioning of training and test sets for the models.
# Using Cross-fold validation for better model performance.
vfold_pm <- rsample::vfold_cv(data = train_pm, v = 4)
resample_fit <- tune::fit_resamples(PM_wflow, vfold_pm)
tune::show_best(resample_fit, metric = "rmse")
## # A tibble: 1 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 rmse standard 2.02 4 0.0645 Preprocessor1_Model1
With the cross-fold validation, we find that the RMSE value is approximately 2.02, showing that the linear model’s data definitely shows some bias, even with the normalization of some values.
Additionally, we also have a RandomForest model. RandomForest models are useful for handling categorical variables and preventing the algorithm from overfitting the data. The following code chunks are for the RandomForest models, which uses decision trees according to the data to make predictions rather than linear regression.
RF_rec <- recipe(train_pm) |>
update_role(everything(), new_role = "predictor")|>
update_role(value, new_role = "outcome")|>
update_role(id, new_role = "id variable") |>
update_role("fips", new_role = "county id") |>
step_novel("state") |>
step_string2factor("state", "county", "city") |>
step_rm("county") |>
step_rm("zcta") |>
step_corr(all_numeric())|>
step_nzv(all_numeric())
# install.packages("randomForest")
tune_RF_model <- rand_forest(mtry = tune(), min_n = tune()) |>
set_engine("randomForest") |>
set_mode("regression")
tune_RF_model
## Random Forest Model Specification (regression)
##
## Main Arguments:
## mtry = tune()
## min_n = tune()
##
## Computational engine: randomForest
We first initialize and tune the RandomForest recipe, examining
similar factors to that of our linear model
lm_PM_model
.
RF_tune_wflow <- workflows::workflow() |>
workflows::add_recipe(RF_rec) |>
workflows::add_model(tune_RF_model)
n_cores <- parallel::detectCores()
doParallel::registerDoParallel(cores = n_cores)
tune_RF_results <- tune_grid(object = RF_tune_wflow, resamples = vfold_pm, grid = 20)
tune_RF_results
## # Tuning results
## # 4-fold cross-validation
## # A tibble: 4 × 4
## splits id .metrics .notes
## <list> <chr> <list> <list>
## 1 <split [438/146]> Fold1 <tibble [40 × 6]> <tibble [0 × 3]>
## 2 <split [438/146]> Fold2 <tibble [40 × 6]> <tibble [0 × 3]>
## 3 <split [438/146]> Fold3 <tibble [40 × 6]> <tibble [0 × 3]>
## 4 <split [438/146]> Fold4 <tibble [40 × 6]> <tibble [0 × 3]>
The RandomForest recipe is then incorporated into the workflow, and
then the results are examined using a tune_grid
function.
tune_RF_results |>
collect_metrics()
## # A tibble: 40 × 8
## mtry min_n .metric .estimator mean n std_err .config
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 23 7 rmse standard 1.59 4 0.116 Preprocessor1_Model01
## 2 23 7 rsq standard 0.619 4 0.0617 Preprocessor1_Model01
## 3 17 30 rmse standard 1.64 4 0.111 Preprocessor1_Model02
## 4 17 30 rsq standard 0.600 4 0.0632 Preprocessor1_Model02
## 5 15 19 rmse standard 1.62 4 0.102 Preprocessor1_Model03
## 6 15 19 rsq standard 0.618 4 0.0611 Preprocessor1_Model03
## 7 26 40 rmse standard 1.65 4 0.108 Preprocessor1_Model04
## 8 26 40 rsq standard 0.592 4 0.0571 Preprocessor1_Model04
## 9 34 20 rmse standard 1.62 4 0.124 Preprocessor1_Model05
## 10 34 20 rsq standard 0.605 4 0.0627 Preprocessor1_Model05
## # ℹ 30 more rows
show_best(tune_RF_results, metric = "rmse", n = 1)
## # A tibble: 1 × 8
## mtry min_n .metric .estimator mean n std_err .config
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 23 7 rmse standard 1.59 4 0.116 Preprocessor1_Model01
We then look at the RMSE value of the RandomForest model with
n=7
, which is approximately 1.59; it signifies that even
with a bias in data, it still has more precision than that of the linear
model.
tuned_RF_values<- select_best(tune_RF_results)
tuned_RF_values
## # A tibble: 1 × 3
## mtry min_n .config
## <int> <int> <chr>
## 1 23 7 Preprocessor1_Model01
# specify best combination from tune in workflow
RF_tuned_wflow <-RF_tune_wflow |>
tune::finalize_workflow(tuned_RF_values)
# fit model with those parameters on train AND test
overallfit <- RF_tuned_wflow |>
tune::last_fit(pm_split)
collect_metrics(overallfit)
## # A tibble: 2 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 rmse standard 1.81 Preprocessor1_Model1
## 2 rsq standard 0.526 Preprocessor1_Model1
test_predictions <- collect_predictions(overallfit)
With this model, we have an overall RMSE of 1.80. This performs better than our linear regression model that has a RMSE of 2.04. Assessing model performance, and extracting the best features that are most predictive of the value variable.
overallfit |>
extract_fit_parsnip() |>
vip::vip(num_features = 10)
# fit data
RF_final_train_fit <- parsnip::fit(RF_tuned_wflow, data = train_pm)
RF_final_test_fit <- parsnip::fit(RF_tuned_wflow, data = test_pm)
# get predictions on training data
values_pred_train <- predict(RF_final_train_fit, train_pm) |>
bind_cols(train_pm |> select(value, fips, county, id))
# get predictions on testing data
values_pred_test <- predict(RF_final_test_fit, test_pm) |>
bind_cols(test_pm |> select(value, fips, county, id))
Based on our visualizations and relationships between variables found
in the EDA, we decided to conduct further analysis by investigate the
interaction between poverty, education level, and pollution level, to
see if they have different effects in urban and rural areas. For the
purposes of this analysis, the group partitioned the
urc2013
variable into 3 groups, with the range 1-2
representing urban areas, 3-4 suburban areas, and 5-6 as rural
areas.
# Mutating the dataset in order to create 3 different types of urbanity
pm <- read_csv("data/pm25_data.csv")
pm <- pm %>%
mutate(urban_category = case_when(
urc2013 <= 2 ~ "Urban",
urc2013 <= 4 ~ "Suburban",
TRUE ~ "Rural"
),
urban_category = factor(urban_category))
# 1. Fit separate models for each area type
urban_model <- lm(value ~ pov + nohs + somehs + hs + somecollege +
associate + bachelor + grad,
data = dplyr::filter(pm, urban_category == "Urban"))
suburban_model <- lm(value ~ pov + nohs + somehs + hs + somecollege +
associate + bachelor + grad,
data = dplyr::filter(pm, urban_category == "Suburban"))
rural_model <- lm(value ~ pov + nohs + somehs + hs + somecollege +
associate + bachelor + grad,
data = dplyr::filter(pm, urban_category == "Rural"))
# Get summaries of area-specific models
urban_summary <- summary(urban_model)
suburban_summary <- summary(suburban_model)
rural_summary <- summary(rural_model)
print(urban_summary)
##
## Call:
## lm(formula = value ~ pov + nohs + somehs + hs + somecollege +
## associate + bachelor + grad, data = dplyr::filter(pm, urban_category ==
## "Urban"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.215 -1.249 0.357 1.221 8.716
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 181.16702 151.01249 1.200 0.2311
## pov 0.02412 0.01403 1.719 0.0865 .
## nohs -1.72062 1.51069 -1.139 0.2555
## somehs -1.68736 1.51023 -1.117 0.2646
## hs -1.67042 1.51050 -1.106 0.2695
## somecollege -1.73768 1.50959 -1.151 0.2505
## associate -1.71928 1.51030 -1.138 0.2557
## bachelor -1.70795 1.50948 -1.131 0.2586
## grad -1.69418 1.51155 -1.121 0.2631
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.19 on 357 degrees of freedom
## Multiple R-squared: 0.06117, Adjusted R-squared: 0.04013
## F-statistic: 2.907 on 8 and 357 DF, p-value: 0.003742
print(suburban_summary)
##
## Call:
## lm(formula = value ~ pov + nohs + somehs + hs + somecollege +
## associate + bachelor + grad, data = dplyr::filter(pm, urban_category ==
## "Suburban"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.8042 -1.5168 0.1097 1.4564 11.7876
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 240.003534 201.902447 1.189 0.235
## pov -0.006719 0.016145 -0.416 0.678
## nohs -2.269808 2.018715 -1.124 0.262
## somehs -2.211967 2.023193 -1.093 0.275
## hs -2.296811 2.019562 -1.137 0.256
## somecollege -2.291402 2.019005 -1.135 0.257
## associate -2.372739 2.015763 -1.177 0.240
## bachelor -2.300595 2.019021 -1.139 0.255
## grad -2.302140 2.020248 -1.140 0.255
##
## Residual standard error: 2.675 on 342 degrees of freedom
## Multiple R-squared: 0.07448, Adjusted R-squared: 0.05283
## F-statistic: 3.44 on 8 and 342 DF, p-value: 0.0007964
print(rural_summary)
##
## Call:
## lm(formula = value ~ pov + nohs + somehs + hs + somecollege +
## associate + bachelor + grad, data = dplyr::filter(pm, urban_category ==
## "Rural"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.6241 -1.3608 0.4603 1.6860 5.9887
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -112.96009 291.33194 -0.388 0.699
## pov -0.01331 0.02677 -0.497 0.620
## nohs 1.25747 2.91330 0.432 0.667
## somehs 1.32151 2.91554 0.453 0.651
## hs 1.22013 2.91315 0.419 0.676
## somecollege 1.21174 2.91223 0.416 0.678
## associate 1.25122 2.91775 0.429 0.669
## bachelor 1.19029 2.91253 0.409 0.683
## grad 1.19536 2.91169 0.411 0.682
##
## Residual standard error: 2.481 on 150 degrees of freedom
## Multiple R-squared: 0.09711, Adjusted R-squared: 0.04895
## F-statistic: 2.017 on 8 and 150 DF, p-value: 0.04808
These models are highly ineffective, with R^2 values below 0.10. Instead, we will use the full model given in Lesson 12 and enhance it by including interactions between poverty + education variables and the urc2013 variable. We also subtract five of the variables from the full model, as they were found to decrease accuracy.
pov_urc = lm(value ~ . + (pov + nohs + somehs + hs + somecollege +
associate + bachelor + grad) * urc2013 - zcta_area - imp_a10000 - imp_a15000 - popdens_zcta - lon, data=pm)
glance(pov_urc)
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.991 0.918 0.740 13.5 4.19e-34 784 12.5 1547. 5300.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
The adjusted R-squared value of this model is 0.005 better than the full model, making it a slightly better model.
While our models that utilized data segmented into urban, suburban, and rural did not perform well, our enhanced full model delivered promising results. It is well known that higher education levels is correlated with lower poverty. Knowing this, we can look at how that interaction interacts with area type. We theorize that urban areas with high poverty and low education levels are usually industrial, as more rent prices are lower in these areas and the workers are generally less educated. Industrial areas are known for production, which usually causes pollution and poorer air quality. Meanwhile, rural areas as a whole are generally less developed. This means less pollution and better air quality. However, urban areas with low poverty and high education levels are generally “gentrified” and prioritize clean air, which would mean better air quality. Therefore, the interaction between these three factors (poverty, education, area type) is more complex than a simple linear interaction. This is why we chose to add these interaction effects to the full model.
To visualize our model’s performance the group plotted the actual values of pollution for our data in the test set against the model’s predicted values.
wf_fitted_values |>
ggplot(aes(x = value, y = .fitted)) +
geom_point() +
geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed", size = 1) +
labs(
title = "Predicted vs. Actual Pollution Values for Linear Regression Modelling",
subtitle = "Perfect predictions would fall on the dashed red line",
x = "Predicted Values (µg/m³)",
y = "Actual Values (µg/m³)"
) +
theme_minimal() +
theme(
plot.title = element_text(size = 14, face = "bold"),
plot.subtitle = element_text(size = 10, color = "gray40"),
panel.grid.major = element_line(color = "gray90"),
panel.grid.minor = element_line(color = "gray95"),
axis.text = element_text(size = 10),
axis.title = element_text(size = 12)
)
This is our graph of actual values of pollution versus the predicted values in our RF model.
ggplot(values_pred_test, aes(x=value,y=.pred))+
geom_point(alpha = 0.5, color = "#2C3E50") +
geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed", size = 1) +
labs(
title = "Predicted vs. Actual Pollution Values for Random Forest Modelling",
subtitle = "Perfect predictions would fall on the dashed red line",
x = "Actual Values (µg/m³)",
y = "Predicted Values (µg/m³)"
) +
theme_minimal() +
theme(
plot.title = element_text(size = 14, face = "bold"),
plot.subtitle = element_text(size = 10, color = "gray40"),
panel.grid.major = element_line(color = "gray90"),
panel.grid.minor = element_line(color = "gray95"),
axis.text = element_text(size = 10),
axis.title = element_text(size = 12)
)
Results of analysis and modeling. Primarily discussing the linear
model’s RMSE (2.04) and the RandomForest’s RMSE (1.58), as well as the
extension analysis (I think the best model has an adjusted R^2 value of
0.9145), which is higher than the full model shown in class (according
to the Discord gc). Discuss how our limits include wrangling the dataset
given its skew (mainly in the pov
and
popdens_zcta
).
Our analysis reveals important insight into the relationship shared by poverty, density, and pollution levels.
In our extended analysis, we observed the dynamics between air pollution, poverty, and education levels given different geographical contexts. The low adjusted R^2 values across all the linear regression models within the URC subsets (urban, suburban, rural) suggest that poverty and education levels may not have a strong influence on pollution levels, regardless of area type.
Limitations within our work include working with variables with
skewed distribution (mainly pov
and
popdens_zcta
), which required log transformations to
normalize data for meaningful analysis. This excluded zero values from
our analysis and may have masked non-linear relationships, potentially
limiting application beyond this study.
Within this study, we used machine learning models, specifically linear regression and random forest, to determine how accurately we can predict particulate matter pollution within the United states. Of the models tested, the Random Forest model performed the best, with a Root Mean Squared Error of 1.58, compared to the linear regression model with a RMSE of 2.04. Having a lower RMSE value indicates that the model’s predictions are closer to the actual values. With the Random Forest model, using the adjusted R^2, we determined that 91.45% of the variance within the data could be explained with our model. Within our extended analysis, we determined that education levels and poverty do not have different effects on particulate matter pollution levels within different URC classifications. Furthermore, given the low adjusted R^2 values across the prediction models, education levels and poverty in an area don’t seem to be indicative of pollution at all. Further research can include investigating interactions between other variables within the dataset, to improve prediction accuracy. Overall, we use machine learning methods to effectively predict particulate matter pollution levels within the US, which is a crucial step in identifying high-risk areas and gaining deeper insights into its effects on human health.
For our extended analysis, we first attempted to segment the data into three sets, each corresponding to an area type (urban, suburban, and rural), then building area-specific models. We tried to only use poverty level and education levels as features, but these models performed very poorly, with very low R^2 values. We decided to pivot to enhancing the full model introduced in class. We added interaction effects between urc2013 and both pov and the education variables, as well as removing the five variables that were found to be unhelpful in Lesson 12. With that, we were able to develop a model that was slightly more accurate than the full model.
To further understand if the dynamic between poverty level, education level, and area type can illuminate how to predict pollution, it would be easiest to use a dataset that contains information detailing the kinds of buildings in each area. This way, we could avoid making inferences about whether or not the area consists of high-polluting production plants or cleaner buildings. Further research could then look more closely at what buildings contribute to lower air quality. This would inform future urban development to prevent air quality issues and fight climate change.
Cohen et al. (2017). Estimates and 25-year trends of the global burden of disease attributable to ambient air pollution: an analysis of data from the Global Burden of Diseases Study 2015. Lancet (London, England), 389(10082), 1907–1918. https://doi.org/10.1016/S0140-6736(17)30505-6↩︎
Health Effects Institute. 2018. State of Global Air 2018. Special Report. Boston, MA:Health Effects Institute.↩︎