How NOT to Collect Sensitive Data

Mar 9, 2021

Lessons Learned From the 2020 Census and COVID appointments

2 Comments

Mar 11, 2021

Is there a commonly used way to ask sensitive questions that isn't "push polling"? Some sort of proxy data, or side-channel like international money transfers? Or is there a mandate that questions be more direct for the census in particular?

For the 2020 census citizenship question, reading the press it certainly seemed like kneecapping the response was the point. So other serving as an object of what not to do, it doesn't really bear on someone making a good-faith effort to collect data. Is there an example of a good data collection on a politically sensitive topic that you can point to contrast?

Expand full comment

Reply (1)

Jacob Model

Mar 11, 2021

So there's a bunch of ways people do this. I'll probably write on this in more detail.

You can do A LOT with question design. So, for example, it might be better to ask someone if they have an SSN, ITIN, or neither instead of whether they're undocumented. That's not a perfect question, but it's way less sensitive than asking straight up if they're undocumented.

You also could do a lot by what I call the process of elimination questions. So you could ask a series of Yes/No questions and leave out the final category. Or you can ask a series of questions or have some residual bucket. For example, in one project I was working on we wanted to know if people filed their taxes last year. So we asked them if they filed online, by mail, with a preparer, or if they didn't remember.

That last category provides people who didn't file an easy out without making them feel like they're tax dodgers. It's a little noisy, but not too bad.

-----

There's also a statistical approach of injecting noise into the answer so it's useful in aggregate, but not individually identifying. Here are a few variants:

Randomized Response: Ask someone to flip a coin before answering a yes/no question. If it lands heads, they answer truthfully. If its tails, answer "No". You can do this with other random things, like having an even or odd birthday.

List Experiment: Start with a list of policies. Randomly swap a policy with a controversial one. You then ask both groups to report how many policies agree with. This lets you look

For some examples, see https://dimewiki.worldbank.org/wiki/List_Experiments

I won't get into it, but there are endorsement experiments as well, which randomize controversial groups/people endorsing a policy. They're more limited in scope (see this article for more detail: https://onlinelibrary.wiley.com/doi/pdf/10.1111/ajps.12205?casa_token=7tcYpLuGX7IAAAAA:EA8pv_mk8AIqWrFqU3YuOdMvwav6j_rUMHuB0WD46jRkfdiyijvXrh9ndsDiE9MpwpSBnPD3ZiB2mlo)

List experiments, in particular, have fallen a bit out of fashion because they're difficult for a lot of respondents to understand. (See https://academic.oup.com/poq/article/83/S1/236/5525050 for more details.)

The bottom line is that if you have to do something, I really like the randomized response. It's pretty straight-forward to administer and the person taking it has some agency with the randomization, so they can feel like it's actually protecting their privacy.

Expand full comment

Data Better

How NOT to Collect Sensitive Data