Creating Survey Weights in R Using Census Data
Published:
In survey research, the composition of a sample may differ notably from the population being modeled across important characteristics (e.g. race, age, education, party identification). These sampling errors often reflect systematic bias which can pose a threat to accuracy and the researcher’s ability to make inferences using the data, especially if the error is correlated with the variables of interest.
In response, researchers often construct “post-stratification weights” — values assigned to each observation which can be used to correct for sampling error and match the sample to the population on key characteristics. The idea is simple: by increasing the influence of under-represented units and likewise decreasing the influence of over-represented units, researchers can create a more representative sample. This is particularly important given the dramatic increase in the use of online opt-in (or “nonprobability”) samples.
An important implication is that you can only construct post-stratification weights using characteristics that are both measured in your data and known for the population; put more simply, if we want to make our sample reflect the population, we have to know what the population looks like! Thus, the American Commmunity Survey (ACS), a premier data source with detailed population information, is an incredible resource to construct population targets for survey weighting.
Last year, I created survey weights for a project fielded with the Tufts Public Opinion Lab, but the task proved to be fairly involved. First, accessing ACS data requires navigating the Census Bureau’s fairly labyrinthine website. In addition, the size of the national ACS files – totaling over 10 gigabytes – is far beyond what my computer’s memory could possibly handle and crashed R Studio the first time I tried to read them in.
In this blog post, I create weights for a simplified version of the TPOL data. I walk through the process of accessing ACS data and using it to construct survey weights using R. I wrote this piece for myself, future lab students, and anyone interested in learning how to access ACS data and use it to create survey weights in R.
This post assumes familiarity with the basic concept of post-stratification weights and their uses; those in search of additional reading on weighting are directed to this handy piece on different weighting methods by Pew. While there are multiple weighting methods, the {anesrake}
package allows for accessible and automated raking so I use this method to create the weights. To follow along, you will need a basic working knowledge of R. You’ll also use the {data.table}
, {survey}
, and {tidyverse}
packages so be sure to install those by running install.packages()
in the console as needed if you haven’t installed them already.
Choosing your weighting variables
There’s no magic “one-size-fits-all” set of variables to use when creating weights. To decide which variables to weight on, researchers should be mindful of a few things. We care most about addressing imbalances in the data that bias the outcomes of interest, so “good” weighting variables are generally correlated with the attitudes and behaviors of interest. And, as previously mentioned, we need well-measured, reliable population benchmarks for each variable we decide to weight the data on. Public opinion researchers often create weights using a similar set of demographic variables, including but not limited to race and ethnicity, age, sex, education and geographic region. Other weighting variables can include voter registration status, party identification, religion, recalled or validated voting history, and (more recently) even social trust or primary participation.
For this tutorial, I choose to weight the data on race and ethnicity, region, age, sex, educational attainment, the interaction of education and race, and 2020 presidential vote. Population benchmarks for 2020 presidential vote can be obtained from the official FEC 2020 Presidential Election Results and the United States Election Project. To create the population benchmarks for demographic data, we turn to the ACS.
Accessing and downloading ACS data
Personally, I find the ACS website somewhat difficult to navigate. Navigating to pages looking for data generally takes you to the Census Bureau’s pretabulated data products, but for our purposes we need data that we can read in and reshape directly.
For this, we can use the ACS Public Use Microdata Sample (PUMS). You can directly access PUMS files for most years between 1996 and 2019 (the most recent available data) through the Census Bureau’s File Transfer Protocol (FTP) site: click this link and select a year. I turn to the most recent data (2019).
A quick note on ACS time periods
Remember, the ACS provides estimates for a specific given time period: one year, three years, or five years. It is critical to note that 1-year estimates are not calculated as an average of 12 monthly values (and similarly, the 5-year estimates are not calculated as the average of 60 monthly values, nor as the average of five individual 1-year estimates).
Instead, the ACS collects survey data continuously, nearly every day of the year and then aggregates the results over the specific time period, spread evenly to avoid placing uneven weight on any given month or year within the period. For ACS 1-year data, this time period is the calendar year, so the 2019 ACS 1-year estimate covers January 2019 to December 2019. For ACS 5-year data, this time period is five calendar years, so the 2019 ACS 5-year estimate covers January 2015 to December 2019.
This creates a tradeoff for researchers choosing between 1-year and 5-year estimates. 1-year data is the most current, but 5-year data can generally be more reliable due to the larger sample size. If both estimates are available for your year in question, which one should you use?
For rapidly changing geographic areas, 1-year data are best as the current data is more likely to show yearly fluctuations, but are only available for geographic areas with at least 65,000 people. If you are hoping to illustrate a smooth trend, however, 5-year data may be best since the 5-year periods overlap. Above all, you must consistently use the same estimate so be sure to pick either 1-year estimates or 5-year estimates (or 3-year estimates, if applicable)1 and stick with them.
For this post, I will be using the 5-year data. Regardless of which period you choose, you should be in a directory filled with .zip files, following the general path:
www2.census.gov/programs-surveys/acs/data/pums/{YEAR}/{1-Year/5-Year}
How do you interpret these zip files, to pick the appropriate data? Their names are a construction using three important features of the data:
{file format}_{record type}{state}.zip
“File format” should take on two values: ‘unix’, denoting SAS datasets, and ‘csv’, denoting comma separated value files. “Record type” is either ‘h’, denoting housing files, or ‘p’, denoting person files. Finally, the file name includes the relevant two-letter state abbreviation code, with the abbreviation “us” denoting the nationwide data.
I prefer to work with the .csv files. Because I am constructing weighting targets for the population of all American adults, I need the person record type for the “us” geography. So, I find “csv_pus.zip” and download the file. Once you download the file, unzip it (on Mac, you can right click and open with Archive Utility). You should be able to see a README .pdf file, and then your .csv data. If you are using the five-year 2019 U.S. data, you will see four .csv files (psam_pusa.csv, psam_pusb.csv, psam_pusc.csv and psam_pusd.csv). Move the folder to an appropriate working directory2 where you can access the data when you start constructing weights in R.
Reading ACS data into R using {fread}
You may notice that the 2015-2019 person-level U.S. data are … quite large. The four .csv files combined surpass 10 gigabytes. To avoid dealing with this headache, we can selectively read in the variables that we need.
As previously stated, in addition to presidential vote I choose to weight the data on race and ethnicity, region, age, sex, educational attainment, and the interaction of education and race. Glancing through the 2019 PUMS data dictionary I can see that the key variables here are HISP
, AGEP
, REGION
, RAC1P
, SCHL
, and SEX
. Additionally, to create population proportion estimates from the ACS, which is itself survey data, we will also need the ACS weights, which in this case is the variable PWGTP
.
Having identified our variables of interest, we can limit the data to just these variables and read it into R using the fread()
command from the {data.table}
package. Here, I do this and then combine each .csv file into a single object called ‘acs’:
# Library setup
library(data.table) # For fread()
library(survey) # To create weighted proportion tables
library(tidyverse) # I rely on the %>% operator from {tidyverse}
library(anesrake) # We use {anesrake} later to create the weights
# Census data is too big, so selectively read in the variables you need.
a <- fread("../../../Archived/Tutorials/Survey-Weights/csv_pus/psam_pusa.csv",
select= c("HISP", "AGEP", "REGION",
"RAC1P", "SCHL", "SEX", "PWGTP"))
b <- fread("../../../Archived/Tutorials/Survey-Weights/csv_pus/psam_pusb.csv",
select= c("HISP", "AGEP", "REGION",
"RAC1P", "SCHL", "SEX", "PWGTP"))
c <- fread("../../../Archived/Tutorials/Survey-Weights/csv_pus/psam_pusc.csv",
select= c("HISP", "AGEP", "REGION",
"RAC1P", "SCHL", "SEX", "PWGTP"))
d <- fread("../../../Archived/Tutorials/Survey-Weights/csv_pus/psam_pusd.csv",
select= c("HISP", "AGEP", "REGION",
"RAC1P", "SCHL", "SEX", "PWGTP"))
# Combine the csvs.
acs <- rbind(a,b,c,d)
rm(a,b,c,d) # Clear the individual objects to save on memory
acs <- subset(acs, acs$AGEP>17) # restrict data to adults
Important note on matching data and targets
There are two important things to remember when creating weights using {anesrake}
. First, the variables you weight on should have the same name for both your data and the population benchmarks. The variables must also have the same number of levels (See the CRAN documentation for more). In this case, my data is an anonymized version of the TPOL data. The relevant variables in the data are hispanic
, sex
, region
, age5
, educ4
, race5
, weduc
and pres
. With that in mind, I reshape and rename the target variables to match my data using the following code:
# Recode variables. ####
acs$hispanic[acs$HISP!=1] <- 1 #Hispanic
acs$hispanic[acs$HISP==1] <- 2 #Nonhispanic
acs$sex[acs$SEX==1] <- 1 #Male
acs$sex[acs$SEX==2] <- 2 #Female
acs$region[acs$REGION==1] <- 1 # Northeast
acs$region[acs$REGION==2] <- 2 # Midwest
acs$region[acs$REGION==3] <- 3 # South
acs$region[acs$REGION==4] <- 4 # West
acs$age5[acs$AGEP<30] <- 1 # 18-29
acs$age5[acs$AGEP>29 & acs$AGEP<45] <- 2 # 30-44
acs$age5[acs$AGEP>=45 & acs$AGEP<55] <- 3 # 45-54
acs$age5[acs$AGEP>=55 & acs$AGEP<65] <- 4 # 55-64
acs$age5[acs$AGEP>64] <- 5 # 65+
acs$educ4[acs$SCHL<17] <- 1 # HS or less
acs$educ4[acs$SCHL>16 & acs$SCHL<20] <- 2 # Some college
acs$educ4[acs$SCHL==20] <- 3 # College degree
acs$educ4[acs$SCHL>20] <- 4 # Post-graduate degree
acs$race[acs$RAC1P==1] <- 1 #White alone
acs$race[acs$RAC1P==2] <- 2 #Black alone"
acs$race[acs$RAC1P>2 &
acs$RAC1P<6] <- 3 #American Indian or Alaskan Native
acs$race[acs$RAC1P==6] <- 4 #Asian alone
acs$race[acs$RAC1P==7] <- 5 #Native Hawaiian and Other Pacific Islander alone
acs$race[acs$RAC1P==8] <- 6 #Some other race alone
acs$race[acs$RAC1P==9] <- 7 #Two or more races
acs$race5 <- 5
acs$race5[acs$RAC1P==1 & acs$hispanic==2] <- 1 # White nonhispanic
acs$race5[acs$RAC1P==2] <- 2 # Black
acs$race5[acs$hispanic==1] <- 3 # Hispanic
acs$race5[acs$RAC1P==6] <- 4 # Asian
acs$weduc <- NA
acs$weduc[acs$race5==1 & acs$educ4==1] <- 1 # White HS less
acs$weduc[acs$race5==1 & acs$educ4==2] <- 2 # White Some college
acs$weduc[acs$race5==1 & acs$educ4==3] <- 3 # White College Grad
acs$weduc[acs$race5==1 & acs$educ4==4] <- 4 # White Postgrad
acs$weduc[acs$race5>1 & acs$educ4==1] <- 5 # NW HS less
acs$weduc[acs$race5>1 & acs$educ4==2] <- 6 # NW Some college
acs$weduc[acs$race5>1 & acs$educ4==3] <- 7 # NW College Grad
acs$weduc[acs$race5>1 & acs$educ4==4] <- 8 # NW Postgrad
Creating population benchmarks with {survey}
Now, our benchmarks need to ultimately take the form of a list of all target values where each list element is a vector corresponding to the weighting targets for a single variable.3 To rephrase in slightly less technical terms, we want to create a list of the variables we are weighting on (in this case race and ethnicity, region, age, sex, educational attainment, the interaction of education and race, and 2020 presidential vote). In this list, each item represents a variable and is a vector, or an object of numerical values, where each numerical value is the proportion of the population represented by the given level of the variable.
This may make more sense in concrete terms, so hopefully creating the list will help illustrate the concept. Recall that we can use the {survey}
package to calculate weighted proportions from survey data, and see my earlier tutorial for an overview of how to use {survey}
. We can use {survey}
and the ACS to create numeric vectors containing the proportional breakdown of our targets with the following code:
# Create survey design object using ACS and weights
svy.acs <- svydesign(ids=~1, data=acs, weights=acs$PWGTP)
rm(acs) # remove the ACS to save memory
# Time to create some targets! ####
sex <- svytable(~sex, design=svy.acs) %>%
prop.table() %>%
round(digits=3) %>%
as.numeric() # Female 51.3 Male 48.7
region <- svytable(~region, design=svy.acs) %>%
prop.table() %>%
round(digits=3) %>%
as.numeric() # NE 17.6 MW 20.9 S 37.8 W 23.6
age5 <- svytable(~age5, design=svy.acs) %>%
prop.table() %>%
round(digits=3) %>%
as.numeric() # 18-29: 21.4 30-44: 25.1 45-54: 16.7 55-64: 16.6 65+:20.2
educ4 <- svytable(~educ4, design=svy.acs) %>%
prop.table() %>%
round(digits=3) %>%
as.numeric() # HS less: 35.9 Some: 26.5 College: 8.1 Postgrad: 29.5
race5 <- svytable(~race5, design=svy.acs) %>%
prop.table() %>%
round(digits=3) %>%
as.numeric() # White: 63.6 Black: 12.0 Hisp: 15.9 Asian: 5.7 Other:2.8
hispanic <- svytable(~hispanic, design=svy.acs) %>%
prop.table() %>%
round(digits=3) %>%
as.numeric() # Hisp 15.9 Nonhisp 84.1
weduc <- svytable(~weduc, design=svy.acs) %>%
prop.table() %>%
round(digits=3) %>%
as.numeric()
pres <- c(.315, .288, .011, .385) #Biden, Trump, Other, No Vote
So here, we see that for example sex
is the vector (.513, .487), indicating that the first level of sex
(female) are estimated to be 51.3 percent of the population while the second level (male) are 48.7 percent. We can also see it’s easy to manually specify the targets if you already know them: pres
is the vector of presidential vote targets created manually.
Finally, we combine the individual vectors into a list using the list()
function, which I name targets
, and assign names to the vectors, using the following code:
targets <- list(sex, region, age5, educ4, race5, hispanic, weduc, pres)
# remember, these names will have to match
names(targets) <- c("sex", "region", "age5", "educ4",
"race5", "hispanic", "weduc", "pres")
Remember that the names for each target variable will have to match the name of the variable in your collected data. Finally, we read in our data (if you’re following along, you can access my data by downloading it at this link), and clean it by removing any NA values in the variables we weight on:
poll <- read.csv("sample_data.csv")
# Remove cases with NAs on weighting variables
poll <- subset(poll, !is.na(sex))
poll$sex <- as.numeric(poll$sex)
poll <- subset(poll, !is.na(region))
poll$region <- as.numeric(poll$region)
poll <- subset(poll, !is.na(age5))
poll <- subset(poll, !is.na(educ4))
poll <- subset(poll, !is.na(race5))
poll <- subset(poll, !is.na(hispanic))
poll <- subset(poll, !is.na(weduc))
poll <- subset(poll, !is.na(pres))
Using {anesrake}
to create weights
To construct the actual weights, we use an automated raking procedure from {anesrake}
, which iterates through cases and adjusts the weight until the sample and population distributions are closely aligned along our variables of interest. In this tutorial I focus on implementation and leave further discussion of the underlying procedure to Debell and Krosnick (2009).
The anesrake()
command has a few key arguments. The first is your list of targets, which in this case I have conveniently named targets
. The second is simply your data, which here is named poll
. The third is a unique case identifier for individuals in your dataset, which in this case is the variable ResponseID
. You can manually set an upper limit on the values of your weights with cap
, which has a default value of 5 as suggested in Debell and Krosnick. Finally, type = "nolim"
specifies that we want to weight on all of the variables included in targets
.
We can thus create the weights and store them in an object called myweights
, then create a variable called nationalweight
in the data that contains our newly-created weights:
# Create the weights
myweights <- anesrake(targets, poll,
caseid = poll$ResponseId, cap = 8, type = "nolim")
# Store the weights as a variable in your data
poll$nationalweight <- unlist(myweights[1])
And just like that, you’ve done it! Here, you may want to run an additional line of code to save your data with the weights as a new .csv file; if so, just run write.csv(poll, "file-name.csv")
.
How do these weights change our results? By comparing the unweighted and weighted proportion tables using the following code, we can see the unweighted results overestimate the proportion of self-identified liberals and underestimate the proportion of self-identified conservatives.
# Create a svy object using our new weights
poll.weighted <- svydesign(ids = ~1, data = poll, weights = poll$nationalweight)
# Unweighted proportion table
prop.table(table(poll$ideo5))
##
## Conservative Liberal Moderate Not sure
## 0.18398876 0.17275281 0.33497191 0.05196629
## Very conservative Very liberal
## 0.13553371 0.12078652
# Weighted proportion table
prop.table(svytable(~ideo5, design=poll.weighted))
## ideo5
## Conservative Liberal Moderate Not sure
## 0.19572039 0.15251333 0.32417495 0.06874058
## Very conservative Very liberal
## 0.14534022 0.11351054
While somewhat irrelevant to our purposes here, I wanted to note that ACS 3-year estimates have been discontinued and are only available for the 2005-2007, 2006-2008, 2007-2009, 2008-2010, 2009-2011, 2010-2012 and 2011-2013 periods. ↩
Beginners in R looking to review working directories can refer to the segment beginning at 04:49 in the R basics video I created for Tufts students. ↩
If this sounds like gibberish to you, no worries! Maybe check out this overview of lists and vectors. ↩