JeremyBowyer.com

Jeremy Bowyer

Note: To see the full code used to process the data and generate the chart, visit my GitHub

Note: For a brief explanation of how it was done, view the methodology below.

It's no secret to anybody who has been following the 2016 election postmortem that something went wrong in the polling world. Pundits and analysts have focused much of their attention on the rust belt, and for good reason. There were 5 states where Clinton was leading in the polls in the days leading up to the election but ended up going for Trump, and 3 of them were in the Midwest... if you count Pennsylvania as being at least "culturally Midwestern."

State	Clinton Poll Lead	Trump Result	Polling Error
Wisconsin	6.3%	0.8%	7.1%
Pennsylvania	6.1%	1.1%	7.2%
Michigan	5.7%	0.3%	6.0%
North Carolina	2.0%	3.7%	5.8%
Florida	1.5%	1.2%	2.7%

538's Nate Silver was vocal about the possibility of correlated state polling error, and on its face the evidence seems to support that claim. Not included in the table is Ohio, where Trump was leading in the polls leading up to the election, but outperformed by 8.5%, which is similar to the polling error of Ohio's rust belt neighbor states.

It's clear that the misstep in the polling of the rust belt was crucial, given that these states flipping from blue to red provided Trump with the necessary electoral votes, but how abnormal was it? The truth is that the polling error was far greater in less electorally-contentious states. The chart below shows the spread between Trump's win (or loss) margin and Trump's lead (or disadvantage) in the polls. You can drilldown by state and see the spread between the county results and the state poll numbers.

Methodology & Code

polling

The weighted average polls is created roughly following 538's methodology, and the exhaustive poll dataset itself was made available on Kaggle. All transformation and analysis done in R. After reading in the data and removing unecessary columns, the data takes this form:

					>str(polls):
					'data.frame':	10236 obs. of  9 variables:

					 $ type           : chr  "polls-plus" "polls-plus" "polls-plus" "polls-plus" ...

					 $ pollster       : chr  "Google Consumer Surveys" "ABC News/Washington Post" "ABC News/Washington Post" "SurveyUSA" ...

					 $ state          : chr  "U.S." "U.S." "Virginia" "Florida" ...

					 $ population     : chr  "lv" "lv" "lv" "lv" ...

					 $ enddate        : chr  "10/31/2016" "10/30/2016" "10/30/2016" "10/24/2016" ...

					 $ poll_wt        : num  6.14 4.2 3.88 3.4 3.39 ...

					 $ rawpoll_clinton: num  37.7 45 48 48 46 ...

					 $ rawpoll_trump  : num  35.1 46 42 45 40 ...

					 $ rawpoll_johnson: num  6.18 3 6 2 6 7 4.2 3 6.3 5 ...

The data is in a long form data frame, where each row represents a specific poll in a specific state on a specific date, going back to the beginning of the campaign.

The poll_wt field is a numerical value assigned to that poll by the analysts at 538, constructed using three factors:

538's Pollster Ratings
Sample Size
Recency

What this means is that in order to create a weighted average poll, we need to multiply the poll results (colums beginning with "rawpoll") by their corresponding poll_wt value, then for each state we add up those values and divide the resulting number by the sum of the poll_wt values for that state. In R, we can do that using the following code:

				trump_wtd_538 <- 
					unlist(as.list(by(polls, polls$State_ID, function(x) {
						sum(x$poll_wt * x$rawpoll_trump, na.rm = TRUE) / 
						sum(x$poll_wt, na.rm = TRUE)
						})))

What's happening here is the by() function is taking the datframe polls, dividing it up by all unique values in the column State_ID, then applying our custom function to each resulting section. Our custom function performs the tasks laid out in the previous paragraph. In other words, we are creating a weighted average of Trump's (in this particular case) polls in each state and saving them to a named numeric vector. We then do the same thing for Clinton, and calculate the spread between the two candidates' polling averages. This is the value listed as "Poll Spread" on the map's tooltip above.

County and State Results

Actual results by state and county are from Dave Leip's Atlas of U.S. Presidential Elections. They don't require any special explanation or code to transform or clean up. The "Result Spread" figure in the chart above is simply the difference between Trump's vote share in a state/county and Clinton's vote share in that same location. The "Relative Performance" figure is what the map is colored by, and that is the difference between "Result Spread" and "Poll Spread." In other words, the difference between what happened and what the polls said would happen.