EDS 222: Statistics for Environmental Data Science - Final Project
From 2013 to 2021, does air quality (as measured by annual mean PM2.5 concentrations per census tract) vary with poverty rates (as measured by the percent of the population living below two times the federal poverty level per census tract) in California?
In California, events like wildfires can greatly reduce air quality by releasing fine particles called particulate matter, or PM2.5 (Shi et al., 2019). PM2.5 refers to particles with diameters ≤ 2.5 µm, which are known to be hazardous for human health. PM2.5 is especially detrimental for human respiratory and cardiovascular health (Cleland et al., 2021). As California’s wildfires continue to worsen over time, it is becoming increasingly important to monitor air quality, PM2.5 concentrations, and their impacts on populations (Gupta et al., 2018).
However, the environmental burden of poor air quality is not shared equally. For example, the San Joaquin Valley’s economically disadvantaged and ethnically diverse communities breathe some of the most polluted air in the nation (Cisneros et al., 2017). As a result, vulnerable communities such as Mexican American immigrant farm workers and their families experience disproportionately high rates of asthma attacks, hospital admissions, and other medical issues (Schwartz & Pepper, 2009). This inequitable pattern repeats itself in other states (Qian & Wu, 2019), the United States overall (Tessum et al., 2021), and even other countries (Li et al., 2018).
While there are many possible ways to explore the inequity of air pollution in California, I specifically use annual mean PM2.5 to measure of air quality and poverty rate to quantify socioeconomic disparities. I expect to find a significant relationship between these two variables.
My null hypothesis (\(H_0\)) is that, in California, there is no relationship between annual mean PM2.5 concentrations per census tract and percent of the population living below twice the federal poverty line per census tract.
My alternative hypothesis (\(H_A\)) is that, in California, there is a relationship between annual mean PM2.5 concentrations per census tract and percent of the population living below twice the federal poverty line per census tract.
knitr::opts_chunk$set(echo = FALSE,
message = FALSE,
warning = FALSE,
include = TRUE)
#turn off scientific notation and select how many digits to round outputs to
options("scipen" = 999, "digits" = 4)
#import necessary libraries
library(tidyverse)
library(here)
library(gt)
library(xtable)
library(kableExtra)
#read in data
c1 <- read_csv(file = here("_posts",
"2021-11-18-calenviroscreen",
"CES",
"CES_data",
"ces1_2013.csv"))
c2 <- read_csv(file = here("_posts",
"2021-11-18-calenviroscreen",
"CES",
"CES_data",
"ces2_2014.csv"))
c3 <- read_csv(file = here("_posts",
"2021-11-18-calenviroscreen",
"CES",
"CES_data",
"ces3_2018.csv"))
c4 <- read_csv(file = here("_posts",
"2021-11-18-calenviroscreen",
"CES",
"CES_data",
"ces4_2021.csv"))
#clean data
##select and rename necessary columns
##add Year column
c1_clean <- c1 %>%
select(c("ZIP Code","Poverty", "PM2.5")) %>%
mutate(Year = "2013") %>%
dplyr::rename(ZIP = "ZIP Code") %>%
mutate(ZIP = as.numeric(ZIP))
c2_clean <- c2 %>%
select(c("Census Tract", "California County", "ZIP", "Longitude", "Latitude", "Poverty", "PM2.5")) %>%
mutate(Year = "2014",
ZIP = as.numeric(ZIP))
c3_clean <- c3 %>%
select(c("Census Tract", "California County", "ZIP", "Longitude", "Latitude", "Poverty", "PM2.5")) %>%
mutate(Year = "2018",
ZIP = as.numeric(ZIP))
c4_clean <- c4 %>%
select(c("Census Tract", "California County", "ZIP", "Longitude", "Latitude", "Poverty", "PM2.5")) %>%
mutate(Year = "2021",
ZIP = as.numeric(ZIP))
#fill in missing column data for c1 dataset so all datasets have the same columns
c1_fill <- full_join(x = c2_clean, y = c1_clean,
by = "ZIP",
suffix = c(".c2", ".c1")) %>%
select(c("Census Tract",
"California County",
"ZIP",
"Longitude",
"Latitude",
"Poverty.c1",
"PM2.5.c1",
"Year.c1")) %>%
dplyr::rename(Poverty = "Poverty.c1",
PM2.5 = "PM2.5.c1",
Year = "Year.c1")
I downloaded 2013 - 2021 CalEnviroScreen (CES) data from the California Office of Environmental Health Hazard Assessment (OEHHA) and the California Open Data Portal:
Each data set contains columns of environmental pollution burden indicators, including PM2.5, and population characteristics, including rates of poverty. Each census tract in California is represented as a row and assigned a value per environmental indicator and population characteristic. One thing to note is that due to developing technology over time, earlier data have different sample sizes and sampling techniques compared to newer data, which can complicate how we compare the data across time.
Out of the myriad components of the CES data, I am interested in:
The annual mean concentration of PM2.5 is a weighted average of measured monitor concentrations and satellite observations (ug/m3) over 3 years to avoid account for uneven sampling frequency. For example, the CES 1.1. report used data from 2007 - 2009 while the CES 4.0 report used 2015 - 2017. All reports used data from the California Air Resources Board’s Air Monitoring Network (AMN), while CES 3.0 and CES 4.0 also incorporated Satellite Remote Sensing Data.
Data were more likely to be high resolution around certain cities or localized areas, and not all cities had air monitoring stations. Locales with little to no data were either omitted or estimated using nearby locations’ data. For example, in CES 1.1, census tracts with centers > 50km away from the nearest air monitor were omitted. In CES 4.0, missing data was estimated using regression relationships with nearby sites.
For CES 1.1 - 3.0, the quarterly mean PM2.5 concentrations were estimated using ordinary kriging. For CES 4.0, overall PM2.5 annual mean concentrations were estimated for the center of each 1km x 1km grid cell using both the monitoring and satellite data in a weighted average. An inverse-distance weighting method was used, so grid cells close to monitors relied more heavily on monitor estimates while grid cells further from monitors relied more heavily on satellite data. Grid cells with monitors > 50km away relied solely on satellite data.
The quarterly estimates were then averaged to calculate annual means (Figure 1).
ggplot(data = joined, aes(x = PM2.5)) +
geom_histogram(aes(fill = Year), binwidth = 1) +
theme_classic() +
facet_wrap(~Year, ncol = 2) +
labs(x = expression(paste("Mean PM2.5 per census tract (µg/m"^3~")")),
y = "Frequency") +
theme(legend.position = "none")
The percent of the population living below two times the federal poverty level was calculated using a 5-year estimate to produce more reliable results for geographic areas with small populations. For example, the CES 1.1 report used a 5-year estimate from 2007 - 2011 data while the CES 4.0 report used a 5-year estimate from 2015 - 2019 data. Poverty data came from the American Community Survey.
CES defined poverty as twice below the federal poverty line to account for California’s high cost of living relative to other states and because the federal poverty threshold has not changed since the 1980s despite the cost of living increasing over time. The percent per census tract was calculated by individuals living below 200% the poverty level per census tract / total individuals living below 200% of the poverty level (Figure 2). Standard error was calculated to determine the reliability of the calculated poverty rate. Census tracts with unreliable estimates were assigned no value for poverty rate.
ggplot(data = joined, aes(x = Poverty)) +
geom_histogram(aes(fill = Year), binwidth = 5) +
theme_classic() +
facet_wrap(~Year, ncol = 2) +
labs(x = "Poverty rate per census tract (%)",
y = "Frequency")
To assess if, in California from 2013 - 2021, air quality varied with poverty rates, I ran a linear regression of PM2.5 ~ Poverty
for each year (e.g., 2013, 2014, 2018, 2021). This analysis is appropriate to describe how air quality might be changing with respect to poverty rates. Running multiple regressions over different years can help determine how this relationship could be changing over time.
This method is limited by the fact that I am only including one independent variable (Poverty
) in the model. In other words, this analysis is vulnerable to omitted variables bias because it is likely that there are many different factors in addition to poverty that influence air quality. Nevertheless, this is a solid starting point for unraveling those complex relationships.
For all time periods, annual mean PM2.5 concentrations were significantly influenced by the poverty rate (Figure 3). In 2013, PM2.5 increased by 0.0352 µg/m3 as the poverty rate increased by 1% (p-value < 0.001, sd = 0.0044). In 2014, PM2.5 increased by 0.0279 µg/m3 as the poverty rate increased by 1% (p-value < 0.001, sd = 0.0014). In 2018, PM2.5 increased by 0.0299 µg/m3 as the poverty rate increased by 1% (p-value < 0.001, sd = 0.0014). In 2021, PM2.5 increased by 0.0286 µg/m3 as the poverty rate increased by 1% (p-value < 0.001, sd = 0.0013). These results support my hypothesis that mean PM2.5 and poverty in California are positively related.
Over time, the relationship between mean PM2.5 and poverty rate has remained fairly stable with the slope only varying from 0.0279 to 0.0352.
joined <- joined %>%
drop_na(Poverty, PM2.5)
ggplot(data = joined, aes(x = Poverty, y = PM2.5)) +
geom_point(aes(color = Year), alpha = 0.05) +
geom_smooth(method='lm',
formula= y~x,
size=1,
color = "black") +
theme_classic()+
labs(x = "Poverty rate (%)",
y = expression(paste
("Mean PM2.5 (µg/m"^3~")"))) +
facet_wrap(.~Year, ncol = 2) +
theme(legend.position = "none")
As expected, I found a statistically significant relationship between air quality and poverty rates in California during 2013, 2014, 2018, and 2021. For all four years, annual mean concentrations of PM2.5 (µg/m3) increased as the percent of people living below twice the federal poverty level increased (Figure 3). In other words, air quality was on average lower in census tracts with higher poverty rates. These findings supported my hypothesis and corroborated prior research that has identified PM2.5 disparities based on socioeconomic factors in California (Mousavi et al., 2021). This analysis also emphasizes the importance of an environmental justice lens when investigating issues such as air quality.
While my analysis focused on four specific years of comprehensive CalEnviroScreen data, it would be interesting to expand the time frame to before 2013 because 2013 is when California’s cap-and-trade program was initiated. During this time, there is evidence that while overall greenhouse gases were reduced in California, socioeconomically disadvantaged communities actually experienced emission increases (Cushing et al., 2018).
The full code can be accessed here.
For attribution, please cite this work as
Forsline (2021, Dec. 2). Mia Forsline: From 2013 - 2021, California census tracts with higher poverty rates demonstrate worse air quality. Retrieved from miaforsline.github.io/
BibTeX citation
@misc{forsline_ces, author = {Forsline, Mia}, title = {Mia Forsline: From 2013 - 2021, California census tracts with higher poverty rates demonstrate worse air quality}, url = {miaforsline.github.io/}, year = {2021} }