Introduction

Air pollution is a major global health concern, causing 1 in 9 deaths worldwide in 20181. Across the various types of pollution, particulate matter is the most lethal, ranking as the sixth leading cause of death worldwide2. Given its significant impact on human health, determining the amount of particulate matter pollution across the United States is crucial for identifying high-risk areas and learning more about its effects on human health. Factors like poverty and education levels are commonly associated with unequal pollution exposure, but their influence may vary within different geographical contexts. Urban and rural areas differ in a several aspects, such as population density, land use, and pollution sources which could impact the relationship of these factors on exposure levels. In this case study, we investigate to what accuracy are we able to predict annual air pollution concentrations within the US using machine learning methods. Additionally, we investigate whether poverty and education impact pollution exposure differently in urban and rural areas.

Questions

With what accuracy can we predict US annual average air pollution concentrations?

Does the impact of poverty and education levels on pollution exposure differ between urban and rural areas?

Load packages

Here, we import packages tidyverse, tidymodels, and conflicted to clean, model, and predict the data.

library(tidyverse)
library(tidymodels)
library(conflicted)

The Data

The data that we’re examining consists of several monitors across the U.S. that examine air pollution within the area. This is denoted by variable pm. A detailed list of the columns within pm are listed below.

  • id: Monitor number
  • fips: Federal information processing standard number for the county where the monitor is located. Alaska and Hawaii are excluded.
  • Lat: Latitude of monitor
  • Long: Longitude of monitor
  • state: State where monitor is located
  • county: County where monitor is located
  • city: City where monitor is located
  • CMAQ: Estimated values of air pollution from a computational model called Community Multiscale Air Quality (CMAQ)
  • zcta: Zip Code Tabulation Area where the monitor is located
  • zcta_area: Land area of the zip code area in meters squared
  • zcta_pop: Population of ZCTA where monitor is located
  • imp_a500: Impervious surface measure within a circle with a radius of 500 meters around the monitor
  • imp_a1000: Impervious surface measure within a circle with a radius of 1000 meters around the monitor
  • imp_a10000: Impervious surface measure within a circle with a radius of 10000 meters around the monitor
  • imp_a15000: Impervious surface measure within a circle with a radius of 15000 meters around the monitor
  • county_area: Land area of the county of the monitor in meters squared
  • county_pop: Population of the county of the monitor
  • Log_dist_to_prisec: Natural log distance to a primary or secondary road from the monitor highway or major road
  • log_pri_length_5000: Count of primary road length in meters in a circle with a radius of 5000 meters around the monitor (highways only)
  • log_pri_length_10000: Count of primary road length in meters in a circle with a radius of 10000 meters around the monitor (highways only)
  • log_pri_length_15000: Count of primary road length in meters in a circle with a radius of 15000 meters around the monitor (highways only)
  • log_pri_length_25000: Count of primary road length in meters in a circle with a radius of 25000 meters around the monitor (highways only)
  • log_prisec_length_500: Count of primary and secondary road length in meters in a circle with a radius of 500 meters around the monitor (highways + secondary roads)
  • log_prisec_length_1000: Count of primary and secondary road length in meters in a circle with a radius of 1000 meters around the monitor (highways + secondary roads)
  • log_prisec_length_5000: Count of primary and secondary road length in meters in a circle with a radius of 5000 meters around the monitor (highways + secondary roads)
  • log_prisec_length_10000: Count of primary and secondary road length in meters in a circle with a radius of 10000 meters around the monitor (highways + secondary roads)
  • log_prisec_length_15000: Count of primary and secondary road length in meters in a circle with a radius of 15000 meters around the monitor (highways + secondary roads)
  • log_prisec_length_25000: Count of primary and secondary road length in meters in a circle with a radius of 25000 meters around the monitor (highways + secondary roads)
  • log_nei_2008_pm25_sum_10000: Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 10000 meters of distance around the monitor (Natural log)
  • log_nei_2008_pm25_sum_15000: Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 15000 meters of distance around the monitor (Natural log)
  • log_nei_2008_pm25_sum_25000: Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 25000 meters of distance around the monitor (Natural log)
  • log_nei_2008_pm10_sum_10000: Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 10000 meters of distance around the monitor (Natural log)
  • log_nei_2008_pm10_sum_15000: Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 15000 meters of distance around the monitor (Natural log)
  • log_nei_2008_pm10_sum_25000: Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 25000 meters of distance around the monitor (Natural log)
  • popdens_county: Population density (number of people per kilometer squared area of the county)
  • popdens_zcta: Population density (number of people per kilometer squared area of zcta)
  • nohs: Percentage of people in zcta area where the monitor is that do not have a high school degree
  • somehs: Percentage of people in zcta area where the monitor whose highest formal educational attainment was some high school education
  • hs: Percentage of people in zcta area where the monitor whose highest formal educational attainment was completing a high school degree
  • somecollege: Percentage of people in zcta area where the monitor whose highest formal educational attainment was completing some college education
  • associate: Percentage of people in zcta area where the monitor whose highest formal educational attainment was completing an associate degree
  • bachelor: Percentage of people in zcta area where the monitor whose highest formal educational attainment was a bachelor’s degree
  • grad: Percentage of people in zcta area where the monitor whose highest formal educational attainment was a graduate degree
  • pov: Percentage of people in zcta area where the monitor is that lived in poverty in 2008 - or would it have been 2007 source
  • hs_orless: Percentage of people in zcta area where the monitor whose highest formal educational attainment was a high school degree or less (sum of nohs, somehs, and hs)
  • urc2013: 2013 Urban-rural classification of the county where the monitor is located, where 1 is totally urban 6 is completely rural (categories)
  • urc2006: 2006 Urban-rural classification of the county where the monitor is located; works similarly to urc2013
  • aod: Aerosol Optical Depth measurement from a NASA satellite; unit-less and acquired from NASA

Data Import

pm <- read_csv("data/pm25_data.csv")

In general, the data contains various metrics of pollution such as value, CMAQ, aod, etc. There are other pollution-related variables that provide more context to these metrics, such as impervious surface measurements, education demographics, and urban classification of areas surveyed by monitors. The group constructed a correlation plot to explore variables that may have correlations to each other.

Data Wrangling

# This is the correlation plot of all the variables in the pm dataset.
PM_cor <- cor(pm |> dplyr::select_if(is.numeric))
corrplot::corrplot(PM_cor, tl.cex = 0.5)

The group is also exploring the interaction effects of urban-rural classification of areas monitored against other pollution metrics. This is a graph of Aerosol Optical Depth Measurement against various classifications of urbanity.

#AOD plotting aod against urban-rural classification.
urban_plot <- ggplot(pm, aes(x = as.factor(urc2013), y = aod)) +
    geom_boxplot(aes(fill = as.factor(urc2013))) +
    scale_fill_viridis_d() +
    labs(
        title = "Pollution Levels by Urban-Rural Classification",
        x = "Urban-Rural Code (2013)",
        y = "Aerosol Optical Depth Measurement",
        fill = "URC"
    )

urban_plot

We analyze the education percentages of the population (categorized by degrees) and the average detected emissions around each monitor. In this visualization, we see that there is some negative correlation between “higher” degrees (e.g. ≥ hs) and pollution, and a positive correlation between “lower” degrees and pollution (e.g. ≤ somehs); however, since population is not included in this correlation plot, we can’t be too sure on the actual value of just pollution and education levels per area.

long_data <- pm |>
  select(value, nohs:grad) |>
  pivot_longer(cols = -value,
               names_to = "education",
               values_to = "percentage")

ggplot(long_data, aes(x = percentage, y = value)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", color = "#2c7fb8") +
  facet_wrap(~education, scales = "free_x") +
  labs(title = "Average Emissions vs. Education Levels",
       x = "Education percentage",
       y = "Average detected emissions") +
  theme_minimal()

Furthermore, it also seems that the data are clustered heavily to the left side of each plot, so we’ll check the distribution of another feature, pov, as poverty levels can be related to the levels of education per area.

ggplot(pm, aes(pov)) + 
  geom_histogram() +
  labs(title = "Examining the distribution of POV") +
  theme_minimal()

Looking at the pov distribution from the data, there’s a right-skew in the data, signfiying that poorer populations are overrepresented in the dataset.

pm <- mutate(pm, pov=log(pov))

ggplot(pm, aes(pov)) + 
  geom_histogram() + 
  labs(title = "Examining the distribution of POV",
       subtitle = "Log-transformed POV with log_pov") +
  theme_minimal()

We normalize the pov feature in the dataset by log-transforming the values. We then check if the distribution is normalized so we can use it in further EDA and analysis later on.

ggplot(pm, aes(x=pov, y=value)) +
  geom_point() + geom_smooth(method= 'lm', color = "#2c7fb8") +
  facet_wrap(~as.factor(urc2013)) +
  labs(title = "Relationship between poverty and pollution levels",
       x = "Grouped percentages of people in poverty",
       y = "Average detected pollution") + 
  theme_minimal()

Utilizing the normalized pov on examining the average detected pollution, we see that the data are more “evenly” distributed amongst grouped poverty levels; however, the correlations for the levels 1 (urbanized) to 6 (rural) are fairly neutral, with level 3 having a slight positive correlation to average detected pollution levels.

ggplot(pm, aes(popdens_zcta)) + 
  geom_histogram() +
  labs(title = "Examining the distribution of popdens_zcta") +
  theme_minimal()

Additionally, the group would also like to see the distribution of the log-transformed population density against each urban-rural classification.

pm <- mutate(pm, popdens_zcta=log(popdens_zcta))

ggplot(pm, aes(x=popdens_zcta)) +
  geom_histogram() +
  facet_wrap(~as.factor(urc2013)) + labs(title='Log-transformed popdens_zcta for each urc2013 value')

With these transformations, the group is also exploring the relationships of other pollution metrics (CMAQ and AOD) against population density. There appears to be a positive relationship between both variables, suggesting that population density may be a good performer in predictive analysis models.

#Mapping the relationship between the log-transformed pollution measurements and log-transformed population density.

ggplot(pm, aes(x = popdens_zcta)) +
    geom_point(aes(y = log(aod))) +
    geom_smooth(aes(y = log(aod)), method = "loess") +
    labs(
        title = "AOD vs Population Density",
        x = "Log Population Density",
        y = "Pollution Measures",
    ) +
    theme_minimal()

ggplot(pm, aes(x = popdens_zcta)) +
    geom_point(aes(y = log(CMAQ))) +
    geom_smooth(aes(y = log(CMAQ)), method = "loess") +
    labs(
        title = "CMAQ vs Population Density",
        x = "Log Population Density",
        y = "Pollution Measures",
    ) +
    theme_minimal()

Finally, the group is also mapping correlations of other population metrics and pollution metrics.

pm |>
  mutate(popdens_county=log(popdens_county),
         county_pop = log(county_pop)) |>
  select(county_pop, popdens_county, pov, aod, CMAQ)|>
    GGally::ggpairs()+
    ggtitle("Correlations between County Population, Poverty, County Population Density, AOD, and CMAQ")

Analysis

Final recipe for the linear regression model. This model sets value as the outcome variable, with every variable set as the predictor. Afterwards, the model sets variables id and fips as id variables in order for them to not be used in model prediction. The model also one-hot encodes categorical values such as state, county, city, and zcta in order for them to be properly used in a linear regression model.

#Re-reading the data. Just to see if starting with a clean dataset would make the model run correctly
pm_original <- read_csv("data/pm25_data.csv")

# Refactoring ID, fips, and zcta as factors.
pm_original <- pm_original |>
  mutate(across(c(id, fips, zcta), as.factor)) |>
  mutate(city = case_when(city == "Not in a city" ~ "Not in a city",
                          city != "Not in a city" ~ "In a city"))
  
# Splitting the data into training/testing sets.
set.seed(42)
pm_split <- rsample::initial_split(data = pm_original, prop = 2/3)
train_pm <- rsample::training(pm_split)
test_pm <- rsample::testing(pm_split)

# Final recipe for the linear regression model. This one-hot encodes categorical values such as state, county, city, and zcta in order 
novel_rec <- recipe(train_pm) |>
    update_role(everything(), new_role = "predictor") |>
    update_role(value, new_role = "outcome") |>
    update_role(id, new_role = "id variable") |>
    update_role("fips", new_role = "county id") |>
    step_dummy(state, county, city, zcta, one_hot = TRUE) |>
    step_corr(all_numeric()) |>
    step_nzv(all_numeric()) 

Here, the novel_rec created in the previous cell is then prepped using the tidymodels package, and then baked into both the baked_train and baked_test_pm splits of the pm dataset.

prepped_rec <- prep(novel_rec, verbose = TRUE, retain = TRUE)
## oper 1 step dummy [training] 
## oper 2 step corr [training] 
## oper 3 step nzv [training] 
## The retained training set is ~ 0.27 Mb  in memory.
baked_train <- bake(prepped_rec, new_data = NULL)
baked_test_pm <- bake(prepped_rec, new_data = test_pm)

glimpse(baked_test_pm)
## Rows: 292
## Columns: 39
## $ id                          <fct> 1003.001, 1049.1003, 1055.001, 1069.0003, …
## $ value                       <dbl> 9.597647, 11.659091, 12.375394, 10.508850,…
## $ fips                        <fct> 1003, 1049, 1055, 1069, 1073, 1089, 1097, …
## $ lat                         <dbl> 30.49800, 34.28763, 33.99375, 31.22636, 33…
## $ lon                         <dbl> -87.88141, -85.96830, -85.99107, -85.39077…
## $ CMAQ                        <dbl> 8.098836, 8.534772, 9.241744, 9.121892, 9.…
## $ zcta_area                   <dbl> 190980522, 203836235, 154069359, 162685124…
## $ zcta_pop                    <dbl> 27829, 8300, 20045, 30217, 21725, 21297, 2…
## $ imp_a500                    <dbl> 0.01730104, 5.78200692, 16.49307958, 19.13…
## $ imp_a15000                  <dbl> 1.4386207, 0.9730444, 5.1612102, 4.7401296…
## $ county_area                 <dbl> 4117521611, 2012662359, 1385618994, 150173…
## $ county_pop                  <dbl> 182265, 71109, 104430, 101547, 658466, 334…
## $ log_dist_to_prisec          <dbl> 4.648181, 3.721489, 5.261457, 7.112373, 4.…
## $ log_pri_length_5000         <dbl> 8.517193, 8.517193, 9.066563, 8.517193, 8.…
## $ log_pri_length_25000        <dbl> 11.32735, 11.90959, 12.01356, 10.12663, 12…
## $ log_prisec_length_500       <dbl> 7.295356, 7.310155, 8.740680, 6.214608, 8.…
## $ log_prisec_length_1000      <dbl> 8.195119, 8.585843, 9.627898, 7.600902, 9.…
## $ log_prisec_length_5000      <dbl> 10.815042, 10.214200, 11.728889, 12.298627…
## $ log_prisec_length_10000     <dbl> 11.88680, 11.50894, 12.76828, 12.99414, 11…
## $ log_nei_2008_pm10_sum_10000 <dbl> 1.35588511, 0.00000000, 4.43719884, 0.9288…
## $ log_nei_2008_pm10_sum_15000 <dbl> 2.267834, 3.350044, 4.462679, 3.674739, 5.…
## $ log_nei_2008_pm10_sum_25000 <dbl> 5.628728, 5.171920, 4.678311, 3.744629, 8.…
## $ popdens_county              <dbl> 44.265706, 35.330814, 75.367038, 67.619664…
## $ popdens_zcta                <dbl> 145.71643, 40.71896, 130.10374, 185.73917,…
## $ nohs                        <dbl> 3.3, 14.3, 4.3, 5.8, 2.8, 1.2, 5.3, 23.9, …
## $ somehs                      <dbl> 4.9, 16.7, 13.3, 11.6, 8.2, 3.1, 14.2, 17.…
## $ hs                          <dbl> 25.1, 35.0, 27.8, 29.8, 32.0, 15.1, 35.7, …
## $ somecollege                 <dbl> 19.7, 14.9, 29.2, 21.4, 28.9, 20.5, 23.8, …
## $ associate                   <dbl> 8.2, 5.5, 10.1, 7.9, 7.4, 6.5, 8.3, 4.3, 5…
## $ bachelor                    <dbl> 25.3, 7.9, 10.0, 13.7, 13.5, 30.4, 8.3, 4.…
## $ grad                        <dbl> 13.5, 5.8, 5.4, 9.8, 7.1, 23.3, 4.5, 2.0, …
## $ pov                         <dbl> 6.1, 13.8, 8.8, 15.6, 5.1, 5.2, 18.4, 29.5…
## $ hs_orless                   <dbl> 33.3, 66.0, 45.4, 47.2, 43.0, 19.4, 55.2, …
## $ urc2013                     <dbl> 4, 6, 4, 4, 1, 3, 3, 1, 1, 3, 2, 2, 6, 2, …
## $ aod                         <dbl> 37.36364, 33.08333, 43.41667, 33.00000, 36…
## $ state_California            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ state_Illinois              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ state_Ohio                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ city_Not.in.a.city          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

Below, a linear model is set up to examine the effects of pm’s data based on the examined factors of state, county, and zcta, along with other variables id and fips.

lm_PM_model <- parsnip::linear_reg() |>
  parsnip::set_engine("lm") |>
  set_mode("regression")

PM_wflow <- workflows::workflow() |>
    workflows::add_recipe(novel_rec) |>
    workflows::add_model(lm_PM_model)

PM_wflow_fit <- fit(PM_wflow, data = train_pm)

wflowoutput <- PM_wflow_fit |> 
  extract_fit_parsnip() |> 
  broom::tidy() 

wflowoutput
## # A tibble: 37 × 5
##    term         estimate std.error statistic       p.value
##    <chr>           <dbl>     <dbl>     <dbl>         <dbl>
##  1 (Intercept)  1.82e+ 2  1.16e+ 2     1.58  0.116        
##  2 lat          3.65e- 2  2.34e- 2     1.55  0.121        
##  3 lon          3.18e- 2  9.33e- 3     3.41  0.000698     
##  4 CMAQ         2.57e- 1  4.23e- 2     6.08  0.00000000225
##  5 zcta_area   -2.96e-10  2.24e-10    -1.32  0.186        
##  6 zcta_pop     8.92e- 6  5.22e- 6     1.71  0.0876       
##  7 imp_a500     8.17e- 3  7.06e- 3     1.16  0.247        
##  8 imp_a15000  -3.39e- 3  1.17e- 2    -0.289 0.773        
##  9 county_area -1.33e-11  1.59e-11    -0.831 0.406        
## 10 county_pop  -6.70e- 8  8.59e- 8    -0.780 0.436        
## # ℹ 27 more rows
PM_wflow_fit |> 
  extract_fit_parsnip() |> 
  vip::vip(num_features = 10)

Based on the generated visualization, the variable state with the category of California is the most important feature found in the linear model created, following that of CMAQ and aod.

The following code block assesses our model performance. The Root Mean Squared Error (RMSE), is a metric that determines a model’s overall performance by gauging the average difference between a value and a model’s predicted output comes to around 1.89. This means that on average, the linear regression model will predict a value that is around 1.89 ug/m3 off the true value of pollution for that area.

wf_fit <- PM_wflow_fit |> 
  extract_fit_parsnip()

wf_fitted_values <- 
  broom::augment(wf_fit[["fit"]], data = baked_train) |> 
  select(value, .fitted:.std.resid)


yardstick::metrics(wf_fitted_values,
                   truth = value, estimate = .fitted)
## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard       1.89 
## 2 rsq     standard       0.456
## 3 mae     standard       1.41

We then employ cross-fold validation below, a technique that makes multiple variations of data partitioning of training and test sets for the models.

# Using Cross-fold validation for better model performance.
vfold_pm <- rsample::vfold_cv(data = train_pm, v = 4)
resample_fit <- tune::fit_resamples(PM_wflow, vfold_pm)
tune::show_best(resample_fit, metric = "rmse")
## # A tibble: 1 × 6
##   .metric .estimator  mean     n std_err .config             
##   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1 rmse    standard    2.02     4  0.0645 Preprocessor1_Model1

With the cross-fold validation, we find that the RMSE value is approximately 2.02, showing that the linear model’s data definitely shows some bias, even with the normalization of some values.

Additionally, we also have a RandomForest model. RandomForest models are useful for handling categorical variables and preventing the algorithm from overfitting the data. The following code chunks are for the RandomForest models, which uses decision trees according to the data to make predictions rather than linear regression.

RF_rec <- recipe(train_pm) |>
    update_role(everything(), new_role = "predictor")|>
    update_role(value, new_role = "outcome")|>
    update_role(id, new_role = "id variable") |>
    update_role("fips", new_role = "county id") |>
    step_novel("state") |>
    step_string2factor("state", "county", "city") |>
    step_rm("county") |>
    step_rm("zcta") |>
    step_corr(all_numeric())|>
    step_nzv(all_numeric())

# install.packages("randomForest")
tune_RF_model <- rand_forest(mtry = tune(), min_n = tune()) |>
  set_engine("randomForest") |>
  set_mode("regression")
    
tune_RF_model
## Random Forest Model Specification (regression)
## 
## Main Arguments:
##   mtry = tune()
##   min_n = tune()
## 
## Computational engine: randomForest

We first initialize and tune the RandomForest recipe, examining similar factors to that of our linear model lm_PM_model.

RF_tune_wflow <- workflows::workflow() |>
  workflows::add_recipe(RF_rec) |>
  workflows::add_model(tune_RF_model)

n_cores <- parallel::detectCores()
doParallel::registerDoParallel(cores = n_cores)
tune_RF_results <- tune_grid(object = RF_tune_wflow, resamples = vfold_pm, grid = 20)
tune_RF_results
## # Tuning results
## # 4-fold cross-validation 
## # A tibble: 4 × 4
##   splits            id    .metrics          .notes          
##   <list>            <chr> <list>            <list>          
## 1 <split [438/146]> Fold1 <tibble [40 × 6]> <tibble [0 × 3]>
## 2 <split [438/146]> Fold2 <tibble [40 × 6]> <tibble [0 × 3]>
## 3 <split [438/146]> Fold3 <tibble [40 × 6]> <tibble [0 × 3]>
## 4 <split [438/146]> Fold4 <tibble [40 × 6]> <tibble [0 × 3]>

The RandomForest recipe is then incorporated into the workflow, and then the results are examined using a tune_grid function.

tune_RF_results |>
  collect_metrics()
## # A tibble: 40 × 8
##     mtry min_n .metric .estimator  mean     n std_err .config              
##    <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
##  1    23     7 rmse    standard   1.59      4  0.116  Preprocessor1_Model01
##  2    23     7 rsq     standard   0.619     4  0.0617 Preprocessor1_Model01
##  3    17    30 rmse    standard   1.64      4  0.111  Preprocessor1_Model02
##  4    17    30 rsq     standard   0.600     4  0.0632 Preprocessor1_Model02
##  5    15    19 rmse    standard   1.62      4  0.102  Preprocessor1_Model03
##  6    15    19 rsq     standard   0.618     4  0.0611 Preprocessor1_Model03
##  7    26    40 rmse    standard   1.65      4  0.108  Preprocessor1_Model04
##  8    26    40 rsq     standard   0.592     4  0.0571 Preprocessor1_Model04
##  9    34    20 rmse    standard   1.62      4  0.124  Preprocessor1_Model05
## 10    34    20 rsq     standard   0.605     4  0.0627 Preprocessor1_Model05
## # ℹ 30 more rows
show_best(tune_RF_results, metric = "rmse", n = 1)
## # A tibble: 1 × 8
##    mtry min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1    23     7 rmse    standard    1.59     4   0.116 Preprocessor1_Model01

We then look at the RMSE value of the RandomForest model with n=7, which is approximately 1.59; it signifies that even with a bias in data, it still has more precision than that of the linear model.

tuned_RF_values<- select_best(tune_RF_results)
tuned_RF_values
## # A tibble: 1 × 3
##    mtry min_n .config              
##   <int> <int> <chr>                
## 1    23     7 Preprocessor1_Model01
# specify best combination from tune in workflow
RF_tuned_wflow <-RF_tune_wflow |>
  tune::finalize_workflow(tuned_RF_values)

# fit model with those parameters on train AND test
overallfit <- RF_tuned_wflow |>
  tune::last_fit(pm_split)

collect_metrics(overallfit)
## # A tibble: 2 × 4
##   .metric .estimator .estimate .config             
##   <chr>   <chr>          <dbl> <chr>               
## 1 rmse    standard       1.81  Preprocessor1_Model1
## 2 rsq     standard       0.526 Preprocessor1_Model1
test_predictions <- collect_predictions(overallfit)

With this model, we have an overall RMSE of 1.80. This performs better than our linear regression model that has a RMSE of 2.04. Assessing model performance, and extracting the best features that are most predictive of the value variable.

overallfit |> 
  extract_fit_parsnip() |> 
  vip::vip(num_features = 10)

# fit data
RF_final_train_fit <- parsnip::fit(RF_tuned_wflow, data = train_pm)
RF_final_test_fit <- parsnip::fit(RF_tuned_wflow, data = test_pm)

# get predictions on training data
values_pred_train <- predict(RF_final_train_fit, train_pm) |> 
  bind_cols(train_pm |> select(value, fips, county, id)) 

# get predictions on testing data
values_pred_test <- predict(RF_final_test_fit, test_pm) |> 
  bind_cols(test_pm |> select(value, fips, county, id)) 

Extension Analysis

Based on our visualizations and relationships between variables found in the EDA, we decided to conduct further analysis by investigate the interaction between poverty, education level, and pollution level, to see if they have different effects in urban and rural areas. For the purposes of this analysis, the group partitioned the urc2013 variable into 3 groups, with the range 1-2 representing urban areas, 3-4 suburban areas, and 5-6 as rural areas.

# Mutating the dataset in order to create 3 different types of urbanity
pm <- read_csv("data/pm25_data.csv")
pm <- pm %>%
  mutate(urban_category = case_when(
    urc2013 <= 2 ~ "Urban",
    urc2013 <= 4 ~ "Suburban",
    TRUE ~ "Rural"
  ),
  urban_category = factor(urban_category))

# 1. Fit separate models for each area type
urban_model <- lm(value ~ pov + nohs + somehs + hs + somecollege + 
                  associate + bachelor + grad, 
                data = dplyr::filter(pm, urban_category == "Urban"))

suburban_model <- lm(value ~ pov + nohs + somehs + hs + somecollege + 
                    associate + bachelor + grad, 
                    data = dplyr::filter(pm, urban_category == "Suburban"))

rural_model <- lm(value ~ pov + nohs + somehs + hs + somecollege + 
                 associate + bachelor + grad, 
                 data = dplyr::filter(pm, urban_category == "Rural"))

# Get summaries of area-specific models
urban_summary <- summary(urban_model)
suburban_summary <- summary(suburban_model)
rural_summary <- summary(rural_model)
print(urban_summary)
## 
## Call:
## lm(formula = value ~ pov + nohs + somehs + hs + somecollege + 
##     associate + bachelor + grad, data = dplyr::filter(pm, urban_category == 
##     "Urban"))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.215 -1.249  0.357  1.221  8.716 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 181.16702  151.01249   1.200   0.2311  
## pov           0.02412    0.01403   1.719   0.0865 .
## nohs         -1.72062    1.51069  -1.139   0.2555  
## somehs       -1.68736    1.51023  -1.117   0.2646  
## hs           -1.67042    1.51050  -1.106   0.2695  
## somecollege  -1.73768    1.50959  -1.151   0.2505  
## associate    -1.71928    1.51030  -1.138   0.2557  
## bachelor     -1.70795    1.50948  -1.131   0.2586  
## grad         -1.69418    1.51155  -1.121   0.2631  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.19 on 357 degrees of freedom
## Multiple R-squared:  0.06117,    Adjusted R-squared:  0.04013 
## F-statistic: 2.907 on 8 and 357 DF,  p-value: 0.003742
print(suburban_summary)
## 
## Call:
## lm(formula = value ~ pov + nohs + somehs + hs + somecollege + 
##     associate + bachelor + grad, data = dplyr::filter(pm, urban_category == 
##     "Suburban"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.8042 -1.5168  0.1097  1.4564 11.7876 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) 240.003534 201.902447   1.189    0.235
## pov          -0.006719   0.016145  -0.416    0.678
## nohs         -2.269808   2.018715  -1.124    0.262
## somehs       -2.211967   2.023193  -1.093    0.275
## hs           -2.296811   2.019562  -1.137    0.256
## somecollege  -2.291402   2.019005  -1.135    0.257
## associate    -2.372739   2.015763  -1.177    0.240
## bachelor     -2.300595   2.019021  -1.139    0.255
## grad         -2.302140   2.020248  -1.140    0.255
## 
## Residual standard error: 2.675 on 342 degrees of freedom
## Multiple R-squared:  0.07448,    Adjusted R-squared:  0.05283 
## F-statistic:  3.44 on 8 and 342 DF,  p-value: 0.0007964
print(rural_summary)
## 
## Call:
## lm(formula = value ~ pov + nohs + somehs + hs + somecollege + 
##     associate + bachelor + grad, data = dplyr::filter(pm, urban_category == 
##     "Rural"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6241 -1.3608  0.4603  1.6860  5.9887 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) -112.96009  291.33194  -0.388    0.699
## pov           -0.01331    0.02677  -0.497    0.620
## nohs           1.25747    2.91330   0.432    0.667
## somehs         1.32151    2.91554   0.453    0.651
## hs             1.22013    2.91315   0.419    0.676
## somecollege    1.21174    2.91223   0.416    0.678
## associate      1.25122    2.91775   0.429    0.669
## bachelor       1.19029    2.91253   0.409    0.683
## grad           1.19536    2.91169   0.411    0.682
## 
## Residual standard error: 2.481 on 150 degrees of freedom
## Multiple R-squared:  0.09711,    Adjusted R-squared:  0.04895 
## F-statistic: 2.017 on 8 and 150 DF,  p-value: 0.04808

These models are highly ineffective, with R^2 values below 0.10. Instead, we will use the full model given in Lesson 12 and enhance it by including interactions between poverty + education variables and the urc2013 variable. We also subtract five of the variables from the full model, as they were found to decrease accuracy.

pov_urc = lm(value ~ . + (pov + nohs + somehs + hs + somecollege + 
    associate + bachelor + grad) * urc2013 - zcta_area - imp_a10000 - imp_a15000 - popdens_zcta - lon, data=pm)
glance(pov_urc)
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.991         0.918 0.740      13.5 4.19e-34   784   12.5 1547. 5300.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

The adjusted R-squared value of this model is 0.005 better than the full model, making it a slightly better model.

While our models that utilized data segmented into urban, suburban, and rural did not perform well, our enhanced full model delivered promising results. It is well known that higher education levels is correlated with lower poverty. Knowing this, we can look at how that interaction interacts with area type. We theorize that urban areas with high poverty and low education levels are usually industrial, as more rent prices are lower in these areas and the workers are generally less educated. Industrial areas are known for production, which usually causes pollution and poorer air quality. Meanwhile, rural areas as a whole are generally less developed. This means less pollution and better air quality. However, urban areas with low poverty and high education levels are generally “gentrified” and prioritize clean air, which would mean better air quality. Therefore, the interaction between these three factors (poverty, education, area type) is more complex than a simple linear interaction. This is why we chose to add these interaction effects to the full model.

Results & Discussion

To visualize our model’s performance the group plotted the actual values of pollution for our data in the test set against the model’s predicted values.

wf_fitted_values |> 
  ggplot(aes(x =  value, y = .fitted)) + 
  geom_point() + 
  geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed", size = 1) +
  labs(
    title = "Predicted vs. Actual Pollution Values for Linear Regression Modelling",
    subtitle = "Perfect predictions would fall on the dashed red line",
    x = "Predicted Values (µg/m³)",
    y = "Actual Values (µg/m³)"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold"),
    plot.subtitle = element_text(size = 10, color = "gray40"),
    panel.grid.major = element_line(color = "gray90"),
    panel.grid.minor = element_line(color = "gray95"),
    axis.text = element_text(size = 10),
    axis.title = element_text(size = 12)
  )

This is our graph of actual values of pollution versus the predicted values in our RF model.

ggplot(values_pred_test, aes(x=value,y=.pred))+
  geom_point(alpha = 0.5, color = "#2C3E50") +
  geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed", size = 1) +
  labs(
    title = "Predicted vs. Actual Pollution Values for Random Forest Modelling",
    subtitle = "Perfect predictions would fall on the dashed red line",
    x = "Actual Values (µg/m³)",
    y = "Predicted Values (µg/m³)"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold"),
    plot.subtitle = element_text(size = 10, color = "gray40"),
    panel.grid.major = element_line(color = "gray90"),
    panel.grid.minor = element_line(color = "gray95"),
    axis.text = element_text(size = 10),
    axis.title = element_text(size = 12)
  )

Results of analysis and modeling. Primarily discussing the linear model’s RMSE (2.04) and the RandomForest’s RMSE (1.58), as well as the extension analysis (I think the best model has an adjusted R^2 value of 0.9145), which is higher than the full model shown in class (according to the Discord gc). Discuss how our limits include wrangling the dataset given its skew (mainly in the pov and popdens_zcta).

Our analysis reveals important insight into the relationship shared by poverty, density, and pollution levels.

In our extended analysis, we observed the dynamics between air pollution, poverty, and education levels given different geographical contexts. The low adjusted R^2 values across all the linear regression models within the URC subsets (urban, suburban, rural) suggest that poverty and education levels may not have a strong influence on pollution levels, regardless of area type.

Limitations within our work include working with variables with skewed distribution (mainly pov and popdens_zcta), which required log transformations to normalize data for meaningful analysis. This excluded zero values from our analysis and may have masked non-linear relationships, potentially limiting application beyond this study.

Conclusion

Within this study, we used machine learning models, specifically linear regression and random forest, to determine how accurately we can predict particulate matter pollution within the United states. Of the models tested, the Random Forest model performed the best, with a Root Mean Squared Error of 1.58, compared to the linear regression model with a RMSE of 2.04. Having a lower RMSE value indicates that the model’s predictions are closer to the actual values. With the Random Forest model, using the adjusted R^2, we determined that 91.45% of the variance within the data could be explained with our model. Within our extended analysis, we determined that education levels and poverty do not have different effects on particulate matter pollution levels within different URC classifications. Furthermore, given the low adjusted R^2 values across the prediction models, education levels and poverty in an area don’t seem to be indicative of pollution at all. Further research can include investigating interactions between other variables within the dataset, to improve prediction accuracy. Overall, we use machine learning methods to effectively predict particulate matter pollution levels within the US, which is a crucial step in identifying high-risk areas and gaining deeper insights into its effects on human health.

For our extended analysis, we first attempted to segment the data into three sets, each corresponding to an area type (urban, suburban, and rural), then building area-specific models. We tried to only use poverty level and education levels as features, but these models performed very poorly, with very low R^2 values. We decided to pivot to enhancing the full model introduced in class. We added interaction effects between urc2013 and both pov and the education variables, as well as removing the five variables that were found to be unhelpful in Lesson 12. With that, we were able to develop a model that was slightly more accurate than the full model.

To further understand if the dynamic between poverty level, education level, and area type can illuminate how to predict pollution, it would be easiest to use a dataset that contains information detailing the kinds of buildings in each area. This way, we could avoid making inferences about whether or not the area consists of high-polluting production plants or cleaner buildings. Further research could then look more closely at what buildings contribute to lower air quality. This would inform future urban development to prevent air quality issues and fight climate change.

References


  1. Cohen et al. (2017). Estimates and 25-year trends of the global burden of disease attributable to ambient air pollution: an analysis of data from the Global Burden of Diseases Study 2015. Lancet (London, England), 389(10082), 1907–1918. https://doi.org/10.1016/S0140-6736(17)30505-6↩︎

  2. Health Effects Institute. 2018. State of Global Air 2018. Special Report. Boston, MA:Health Effects Institute.↩︎