Discover more from Data Better
What Trump Can Teach Us About Predictive Modeling
Why evaluating and improving predictive modeling is much harder than it looks.
During the 2020 campaign, President Trump was fond of calling polls that showed him losing to Biden “suppression polls.” He explains it quite succinctly:
“When you’re down by 17 [in the polls], people say, `I just can’t waste my time. I’m not going to stand in line'”
The point Trump is making is that polls don’t just reflect reality; they make reality. He is claiming a specific kind of effect: self-fulfilling prophecies where predictions drive behavior in the direction of their predictions. Of course, it’s easy to argue the other side that polls can be self-defeating prophecies. Imagine unenthusiastic supporters of a candidate becoming complacent and not wanting to stand in line to vote for a candidate they expect to win anyway. Perhaps you’re getting flashbacks to 2016.
This isn’t just a problem for FiveThirtyEight and other poll watches; it’s an issue for any predictive model. Political polls are just a simple form of prediction. Complicated models using the latest and greatest machine learning techniques on massive data sets suffer from exactly the same issues. Whether self-fulfilling or self-defeating (or neither) dominates is incredibly important to evaluate the accuracy of a prediction.
From an evaluator’s perspective, the ideal case is for predictions not to affect outcomes. This rarely happens. For instance, political scientists used to think that close elections were basically random. Now there’s a very lively debate on whether politicians can influence close elections.As Trump often reminded us on Twitter, campaigns have access to polling data and can adjust strategies based on it to change outcomes. If we believe that their strategies are effective, there’s potential to sway voters and make polls less accurate.
A case study of this challenge is the Census Bureau’s Low Response Score (LRS).This model produced a score that estimated the percentage of people in a geography that would not complete a Census form on their own. The higher the score, the lower the expected self-response rate: the measure that the Census Bureau uses to measure success. The Bureau cares about this metric because when people don’t complete it on their own, the Census has to spend money to send people to visit people’s homes and possibly make a statistical guess on who is living there if they don’t answer the door, which reduces data quality.
Groups working on the Census outreach widely used this score to decide where to spend their limited resources.Between the Census’s own marketing, foundations, state governments, and local governments, approximately $1B of outreach spending was allocated based on LRS. It’s an important question to ask whether it actually predicted what it was supposed to. On the surface, this should be a straight-forward case; we have a widely used prediction score and clear, well-measured outcomes in the actual self-response rates.
The first step would be defining what success would look like. Measuring whether the score exactly predicted self-response rates is an unfair exercise. There were a lot of factors, such as a pandemic, that likely affected the overall response rates. More importantly, that’s not how decision-makers actually used the score. They treated it more like a classification exercise. This is a little reductionist, but most Census advocates used the score to segment geographies into two basic categories: areas to conduct outreach and areas to ignore. Most groups choose a cutoff of approximately the bottom 20th percentile of the Low Response Score, which they called “hard-to-count” (HTC) areas. These areas received the bulk of the investment from outreach groups.
It’s helpful to calibrate with a few counterfactuals. I chose two potential approaches for what you might do without a predictive score: a naive approach where do you don’t differentiate geographies and a random approach where you randomly label geographies. If you take a naive approach and labeled every geography as not HTC, you’d be right about 80% of the time. If you take a random approach and picked 20% of geographies out of a hat and labeled them HTC, you’d be right about 68% of the time.For LRS to have been useful, it needs to be more accurate than 68% and it should do better than 80%. Otherwise, the Census groups just might have taken a peanut butter approach and spread their resources everywhere. I think either counterfactual is a defensible reference point with the random approach being the most generous.
For the country as a whole, the accuracy was around 81%. It just squeaked by the performance of the naive approach.When you dig into state-by-state numbers, there’s a lot of variation (see chart below). The numbers look even less rosy if you focus on the bottom 10% of geographies instead of the bottom 20%. LRS accuracy is only 87% and barely beats random chance, which has an 82% accuracy rate. A naive approach (90% in this case) actually performs better than LRS predictions.
So how should we assess these data? Does it mean that nearly one billion dollars was misspent? I don’t have a great answer here and I don’t think it’s possible to know. Imagine for a minute that LRS had perfectly predicted actual response rates. That means that the outreach efforts didn’t really move the needle at all and probably weren’t necessary. And from my own analyses of the data, I do think the outreach investments made a difference. But the allocation clearly benefited some places and overlooked others.
Consider a place that overperformed self-response in the past precisely because there were considerable outreach investments. If resources are now shifted away from the place because it is no longer considered HTC, we would expect its performance to decline in the future. Over a longer period of time, predictive models combined with effective interventions can produce this type of yo-yo effect where places go in and out of being prioritized.
How would we design research to test the effectiveness of predictive modeling? You really need randomization with some areas not using LRS for planning. Logistically, that would have been impossible since the Census Bureau did outreach with this score across the entire US. That said, this is something that many organizations may be able to do. For example, a company creating a predictive model of customer churn could randomly assign that information to some account managers and not others. They then could see if these groups had different churn rates going forward.
It may sound easy but the actual implementation of that is really tough from a managerial perspective. Consider the account manager case. If account managers think this could affect their performance, they might not be too happy with the idea of random assignment. For experimentation, I’m a big fan of randomized rollouts to make change management a bit easier. With a randomized rollout, everyone will get the information eventually but some get it earlier or later. You randomize people between these two groups and avoid some of the awkwardness of not providing a potentially useful tool in the name of evaluation. This approach is especially effective if you have constraints, like scheduling trainings or data limitations, that prevent everyone from accessing the information at the same time. In these cases, you could see randomization being fairer than other options, like first come, first served.
Fairness is important because people will game these systems. If one account manager has access to churn information for all clients, you can be sure that information will be flowing through Slack and email in short order to their friends. You can design in some barriers through access control, but it’s also crucial to have people aligned on the usefulness of the test. I wouldn’t trust myself for a second to outwit a group of ambitious account managers looking for an edge.
Communication about how predictive modeling works is really tough. Like in the LRS case, if account managers do change behaviors, the scores may look inaccurate to them. It’s a really subtle point to convey that predictions that look inaccurate in hindsight could actually be helping by altering behavior. It’s equally important to be candid and describe how predictions may not help. Just because a score comes from a model trained on a huge dataset with a fancy methodology doesn’t mean that success is guaranteed.
Finally, good predictive models are constantly evolving. A model that helped yesterday may not work today. Just ask the COVID modeling groups about that one. Or it may work better. Modelers and consumers of models have to celebrate victories but also be prepared to go back to the drawing board at a moment’s notice.
Misunderstandings About the Regression Discontinuity Design in the Study of Close Elections (https://imai.fas.harvard.edu/research/files/RD.pdf)
On The Validity Of The Regression Discontinuity Design For Estimating Electoral Effects: New Evidence From Over 40,000 Close Races (https://scholar.harvard.edu/files/jsnyder/files/ajps_2013_12_10.pdf)
For technical details, see Erdman and Bates (2017) https://www.census.gov/content/dam/Census/topics/research/erdman_bates_2017.pdf
For an example, see http://censushardtocountmaps2020.us/
I haven’t seen anyone publish a final tally on this but states spent approximately $350M (https://abcnews.go.com/US/wireStory/26-states-spending-350-million-2020-census-efforts-67251650).
The Census Bureau spent more than $415M https://www.wsj.com/articles/wpps-y-r-bid-far-lower-than-rivals-for-u-s-census-account-1473792087
All the data I use is from the Census Planning Database and response rate data on hosted on data.world (https://data.world/uscensusbureau/2020-census-response-rate-data)
For math people out there, accuracy is true positives plus true negatives divided by the entire sample, so 80%*80% + 20%*20% = 68%
As a technical note, I used the bottom 20% to reflect how state/local groups tended to work