Selection into the sample on the variable of interest: A Rant

First, a disclaimer:  I dislike mike the mad biologist.  I blame men like him for Trump being president because of his relentless attacks on ‘this woman, I would support any woman for president, just not this one” and the fact that he’s one of those assholes who does absolutely zero work and again, attacks people who are actually doing something as always doing the wrong thing.  Those men make doing things that they claim to support more difficult and emotionally draining while not lifting a finger themselves.  The only reason I ever look at his blog is because he’s on Bardiac’s blogroll and occasionally he’ll have one of those post headlines that I have to click.  Previously I thought he was just an unselfaware misogynist blowhard.  Now, I’m realizing that he’s … also not got very good statistics training.  (Interestingly, I’ve seen a recent survey that finds that people who are explicitly racist also just tend to be wrong about other things unrelated to their racism.  Not saying that translates to implicit misogyny, but… )

Ok, so here’s the post in question.  In it, he claims that a survey of people who aren’t getting vaccinated proves that time pressure and inability to take a day off work are not reasons.  (Therefore any policies targeting making getting vaccinations easier or getting time off work are wasted effort.)

The problem?  If you click on the survey, it says it has a 6.1.% weighted rate for taking the survey.  (So only about 6 out of every 100 people they sent the survey out to actually responded.)  Also, it is an online survey.

If that 6.1% were a randomly selected sample of a general population, there wouldn’t be a problem.   The problem is when the selection into the sample is based on the outcome that you’re measuring.  In this case, if you’re measuring people who don’t have time or ability to get vaccinated, well, likely they don’t have time or ability to take a survey either.

The Census Bureau isn’t stupid– they know this is a problem and they have lengthy documentation about the non-response bias in the sample generally.  They make it clear what you can trust the results for and what you can’t, as well as the limits of their weighting schemes.  The survey isn’t completely useless, but it is only externally valid for the groups that were surveyed!

I had been planning to use this little example of “sample selection on the Y variable” in my stats class this fall, but now I can’t because his response was so ironically ignorant that I have to blog about it instead.

Here’s his response:

The low income people who are supposed to be burdened by the time constraints also don’t report access as an issue compared to other factors. Are there any data that could convince you, or will the answer always be the same?

So– I guess I was right about his complete lack of self-awareness.  Can you imagine being convinced to make a huge policy change based on one extremely selected survey?  The people who ran the survey would never ever want you to make a policy change based on this result!

The answer to the question of what would be convincing (taking it seriously rather than just an accusation of me being set in my ways):

  1.  A nationally representative sample that found the same result.  The US government has some of these, where you are required to take the survey and they have much better response rates (not perfect, but much better).  This survey is not one of them.
  2.  A sample representative of the population who we are trying to target with our policies (ex. a state going into a few factories with onsite vaccination clinics before expanding the program).
  3.  Multiple biased surveys that are biased in different ways (that is, are not biased on the outcome variable of not having enough time).

No good policy maker would make policy based on such flimsy evidence as the survey mike the mad biologist presents.  In fact, we rarely make big changes based on the result of any one study, unless it is the only study available or is the only well-designed large experiment available.  And even then, good policy makers keep their eyes out for new evidence and try not to do huge national things when the evidence is scant.  Ideally we’ll have a largescale randomized controlled trial, but failing that we’ll take a series of mixed methods– qualitative information, event studies (these two are the easiest and cheapest to do but can be biased depending on how they’re done), natural experiments, and so on.  Ideally we’ll have information about heterogeneity– we think, for example, that the effects of the Affordable Care Act and the effects of universal health insurance were different for Oregon compared to Massachusetts compared to Wisconsin or Tennessee.  And that could be because they have different populations and different starting environments, or it could be that each of these states had a different methodology used to study it with different biases.

Unlike Mike the Mad Biologist, every single thing I do (in research and in teaching!) has the potential of helping or harming someone’s life.  I have to be extremely careful.  I don’t make policy recommendations until the bulk of evidence supports those recommendations.  Because, getting back to that first disclaimer– I’m actually out there doing stuff, not just complaining about the people who do things.

So yeah, I teach my students about how not to use samples that are selected on your variable of interest.  It’s a more challenging concept than people say, lying about their weight or height, but it is an extremely important one.  I have a lot of students who go out and design/make/evaluate policy when they graduate.  Hopefully the lessons I give them remain with them.

17 Responses to “Selection into the sample on the variable of interest: A Rant”

  1. mnitabach Says:

    Very interesting post! The conclusion that this particular survey is subject to response selection on the variable of interest of “available time” makes perfect theoretical sense. Is there a way to analyze the survey results themselves for empirical indications that it’s in fact what occurred? Or does it require independent empirical testing (such as by additional new or other previous surveys of the same population)?

    • nicoleandmaggie Says:

      There’s really only one representative sample that does time analysis (the ATUS, which is hooked up with the CPS) and that is very expensive so it’s unlikely for any study to try to repeat those questions. What people usually do is check against observables like demographics—that link does that and says it does not match up to the population. So they weight on demographics, but that doesn’t fix the problem because you’re still missing the groups with those demographics who are time pressured and you’re just weighting up the people who had time for the survey.

  2. bogart Says:

    Argh, yes.

    So, lots of research literally selects on the variable of interest, e.g. studying [effect of smoking] on smokers (with no control) which … I mean if you want to know something about the lived experience of smokers, OK, but if you want to be able to make causal claims about smoking, very much not OK.

    Here I’d argue that the variable of interest (did you get the vaccine) is likely hugely correlated with a variable that shaped response (available time, probably geographic location also) but that those variables aren’t the variable of interest. Though maybe that’s splitting hairs, if it was the (causal) variable of interest that the blog in question was examining.

    But…yeah. And to dig in on another angle having now looked @ the documentation, apparently 88% of US household could be included as part of the sample in the first place (meaning the Census Bureau had what appear to be valid emails and/or phone #s for same). Which sounds like a lot! But it’s an entirely safe bet that those 12% who are missing are not like the 88% that are included.

    And if I’m reading the study methodology right, it relies entirely on an online questionnaire for data collection, the ability to complete which is obviously going to vary contingent on internet access, which is not randomly distributed, either.

  3. CG Says:

    Ugh. This is a great example that I can use when I teach about survey research in a few weeks.

  4. Debbie M Says:

    Yes, yes, yes!

    Except I fear lots of policy decision are made either in spite of evidence or without evidence, just based on it seeming like a good idea so that it can look like they are doing something (examples – post 9/11 airport harassment, Trump’s lowering taxes on the rich even though Reagan demonstrated that it didn’t work). But I wish we used only good studies. In fact, I wish we used surveys instead of voting and representative-phone-call tallies to make more of our decisions. I wish our main way to introduce new policies would be to get scientists to try them out on small groups (like those studies on giving people a free minimum income). Reality is often surprising, and things that don’t seem to make sense often do. But then you also need to teach people those results.

  5. teresa Says:

    Probably what you meant by it’s an online survey but… Not to mention the likelihood of a close correlation between being in a job/logistic/financial situation that makes it difficult to get vaccinated and having limited wifi access or phone data or a pay-as-you-go plan where you’re not about to spend 20 minutes of data/usage answering a survey. IF you even trusted the census bureau enough to have provided a real phone number or email address in the first place.

    Also a good example of “do these data actually answer the question I’m asking (or the question they purport to answer)?”
    It doesn’t seem like the survey is intended to parse what “concerned about side effects” means. Is it not wanting a sore arm and muscle aches? Fear of the scary-sounding but super rare and also-caused-by-COVID things like myocarditis? Conspiracy level things like not wanting to become a magnetic 5G antenna? Or fear that having chills, aches, and fatigue will keep you home from your job resulting in lost wages, missed days, and a lost job that you can’t afford? I’ve read about (admittedly not actually read the primary sources) data supporting the last- and if you don’t ask the question right you’ll never get an accurate answer.

  6. Omdg Says:

    There are concepts in economics and epidemiology that are difficult to grasp (simpsons paradox! Will Rogers bias!). This is not one of them. I am in pain at his stupidity.

    There was a widely lauded paper in anesthesiology that came out recently on burnout that had a 9%response rate that everyone lauded as wonderful. I mean, I liked that the findings showed lack of support at work rather than your family were the real culprits for causing burnout too! Bottom line: The temptation to take findings at face value when they agree with one’s previous biases and not look under the hood is real.

    • nicoleandmaggie Says:

      At least there there are more representative surveys on burnout on other populations so it would be surprising if anesthesiologists react well to reported lack of support at work. (The problem with most of these surveys is the potential for reverse causality—do you report lack of support because you feel burned out or vice versa? How do you measure support?)

  7. revanche @ a gai shan life Says:

    I was dismal at statistics and really wish I’d had a teacher who made sense but I also really hope your students are paying close attention to your classes and take that knowledge with them into their working lives.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: