## What does statistical significance mean?

One of my students sent me this article because we spend some time in class covering Type 1 and Type 2 errors.

All the .05 threshold means is that you have a false positive 1/20 times.  A .005 threshold would say you’re getting a false positive 1/200 times.  So by moving to a .005 threshold, you’re less likely to get a false positive.  That’s good, right?  In common parlance, we’d be less likely to send an innocent person to jail.

Well, that depends.  At the .005 threshold, we’re more likely to get a false negative than you would at the .05 level.  That means we’d be more likely to get a guilty person go free.  (Indeed, the only guaranteed way to send no innocent people to jail would be to send nobody to jail.  I, for one, am happy that folks like Charles Manson are behind bars.)

It isn’t as easy as saying, oh we should just switch to .005.  When you adjust the p-value you’re making trade-offs between type 1 and type 2 error.  With a lower p-value threshold you’re going to be getting a lot more false negatives even with fewer false positives.  What we always need to be cognizant of when we’re doing policy is that significance isn’t everything– we also have to think about what the damage is if this information turns out to be incorrect.  For example, doctors recommend that pregnant women should heat up cold cuts if they’re worried about listeria, which is a very low probability event but if it happens it’s horrible.  It’s pretty easy to avoid room temperature cold-cuts for 9 months, so unless there’s some other difficulties attached to diet, women will probably follow this recommendation.  (And if one accidentally eats room temperature coldcuts while pregnant, one shouldn’t freak out because the probability of getting listeria is very low!)  But if we’re talking about something like doing chemotherapy or surgery, that’s a much more onerous action and we might want to be more sure we need it before going ahead with it.

Another thing to note is that the article talks about how physics and genetics have already made this switch, while most social sciences haven’t.  One big difference between the fields that have made the switch and the fields that have not is how easy it is to get large samples.  A larger sample size will make it so your sample behaves more and more like the population that you’re trying to study.  We can reduce both Type 1 and Type 2 errors simply by increasing the sample size.  So why don’t we do that?  Well, it turns out that increasing the sample size can be very very expensive when you’re dealing with people and behavior.  Sometimes doing the study with a large enough sample to get 80% power and an alpha of .005 might be more expensive than just throwing that same money at the intervention you’re trying to decide about, whether or not it actually works.  There probably is some resistance because people in these fields want to be able to publish their 5% results, but that’s not the main or only reason we haven’t yet made the switch.  Research is complicated and expensive and we have to make trade-offs.

The context for these really does matter, and you shouldn’t necessarily put off making policy choices just because your sample size is too small to get significance (or to make policy changes just because you have significance).  You always have to be aware of the costs and the benefits.

(Incidentally, in case he comes across this, Hi Dan!  I’m assuming that the reporter greatly simplified your arguments here because I know you must know this stuff.)

### 23 Responses to “What does statistical significance mean?”

1. Solitary Diner Says:

My medical school statistics teacher taught us that type 1 errors were the “American” errors, because Americans are more confident and therefore more likely to think that something is correct when it isn’t, while type 2 errors were the “Canadian” errors, because Canadians are more reserved and hesitant and therefore more likely to not think that something is correct even when it is. Totally stereotyping two different countries, but I still remember the difference 11 years later!

She also threw us candies if we got answers correct, which dramatically improved the lectures!

2. Shannon Says:

I think the other thing (which you get at here a bit) that students and researchers commonly miss is that statistical significance doesn’t necessarily mean substantive impact. For instance, I do have a very large data set I am working on (11,000+ bills in state legislatures). Lots of relationships are significant but really, really small – because I have a huge N. It seems that a lot of training these days focuses on significance, but not substance which is a big problem.

• nicoleandmaggie Says:

Or as Deirdre McCloskey famously says, “Significance does not equal oomph!”

I think we’re doing a better job on magnitudes these days than we were 15 years ago because of things like McCloskey’s speaking tour on oomph. My graduate students who have taken stats before are much better about looking at both magnitude and significance than they were when I started. But there’s still a lot of work to be done!

(Spoiler: one of my midterm questions always gets at significance vs. oomph in first semester stats.)

3. Physics and genetics are only gradually making the switch to proper statistics. It has been estimated that something like 85% of biomarker studies are irreproducible, because the studies did not have sufficient statistical power to correct for the huge number of hypotheses being examined. P-hacking remains a serious problem in biology studies, as well as in sociology ones—neither biologists nor sociologists seem to be requiring training in multi-hypothesis analysis, with only high-school level statistics being required of their undergraduates.

Disclaimer: I taught bioinformatics for about 15 years and made sure that our bioinformatics and bioengineering programs included high-level courses in statistical inference—the abuse of p-values has been a hobbyhorse of mine for quite some time.

• nicoleandmaggie Says:

Sociologists have the same problem as economists and other social scientists when it comes to working with people as observations. However, economists are at least really good at showing how sensitive the results are, and we’re getting better at doing pre-registered experiments as well as the simple use of things like bonferroni corrections.

Using a p-value cutoff isn’t bad so long as we realize that any single result is only suggestive. Even better if we show the ps that didn’t make it. Reproducibility is important, and it takes more than one study to “really know” something.

• omdg Says:

Corrections for multiple comparisons are nice when you’re dealing with an audience who doesn’t get that p-value<0.05 = truth, though I've found the Bonferroni to be excessively conservative in many cases. Personally, I like showing the un-doctored p-values and then telling the reader how many comparisons I made so that they can decide for themselves what they think the results mean, but that's probably taking too optimistic an approach with respect to statistical literacy, especially among fellow physicians.

• nicoleandmaggie Says:

All the corrections are doing though is saying hey, you need a smaller p-value.

And it seems like an odd sort of magic that if you have 20 Y variables but only have a hypothesis about one (and only that one) that you don’t have to make the correction but you do have to make that correction if you didn’t have a hypothesis.

In reality if you do have that hypothesis it’s because of theory or because previous work also finds that, so it’s the theory or the fact that your study is partially replication that makes a lower p-threshold ok. But the strength of the theory or how many previous results it’s replicating aren’t taken into account. And you also wouldn’t “have to” do the correction if you just randomly picked one of the 20 and didn’t test the other 19. I mean, if there’s no true connection, you’re still going to get no significance 19/20 times you do this, but it doesn’t make the 1 time that is randomly wrong any more correct. And, of course, you can still make the other type of error in which you don’t call the previously significant result suggestive even if there actually is a connection.

So, I dunno. I still like the showing of all 20 regressions and pointing out that the one set of stars is what you would have gotten by random chance.

• If you have only 20 regressions, then reporting all of them is feasible, though reporters and even other scientists are likely to do p-hacking on the results. Many biomarker experiments consist of picking sets of genes out of RNA-seq experiments, so that the number of hypotheses is about 10,000 choose 8, or 10^32. There is no way to report *all* the results. The Bonferroni correction is a bit too conservative, but the methods used by many people resulted in 95% false positive rates—people are really good at seeing what they want to see, especially when only positive results can get published.

• nicoleandmaggie Says:

Sociologists aren’t doing RNA seq experiments though.

• Leah Says:

I have an MS in ecology and struggle *greatly* with statistics. I teach AP Bio now, and I have to teach SEM, SD, and Chi-squared. Some of my hardest lessons all year. This was a helpful read! I definitely need more training in statistics. I took one undergrad class and struggled through it. The teacher made sense when she talked, but I couldn’t apply the concepts in lab. Where can I go to learn more about this? I want to incorporate more data and data analysis in my high school bio (both AP and gen bio) classroom. I think that “big data” will continue to be a thing, and students really do need to be learning how to interpret data.

• nicoleandmaggie Says:

You teach structural equation modeling in high school? Ohh, looking up AP Bio and SEM, I see that means standard error of the mean.

We just did standard deviation in my class this week. We talk about measures of dispersion vs. the measures of central tendency we do before SD. I first give them a lot of examples of large vs. small standard deviations (some that are biased some that aren’t), then we go through the equation the long way using excel (because it’s really good for teaching the fill down command and other stuff). Your students probably have to know how to do it by hand which is a pain. (From a mathematical standpoint the SD is kind of a weighted geometric average of the distance of each point to the mean…)

I’m not sure where would be best to learn more. There’s always Khan academy, but I haven’t actually watched any of his statistics lectures. The math lectures of his that I have watched have corresponded to how I do things (which is how my best math teachers did things), so I tend to think they’re pretty good.

• omdg Says:

It sounds like some basic epidemiology would be as, if not more useful than basic stats, especially if your goal is how to interpret data. I like the book by Gordis, and it’s pretty accessible and has questions so you can quiz yourself (or your students heh heh).

• randompasserby Says:

I never took stats, but I need it now, and am currently working my way through an online Prob & Stat course from Stanford: https://lagunita.stanford.edu/courses/course-v1:OLI+ProbStat+Open_Jan2017/about. It’s self-paced, clear, well organized, has good examples, good visuals, useful feedback on incorrect answers, plus lets you practice describing your results in prose. Also – the new version lets you mess around with various stats programs. I’m using R – very cool for a previously Excel-only person.

4. nicoleandmaggie Says:

Tell them it doesn’t pass the Jimmy Kimmel test, which means a whole bunch of bad things (like lifetime caps being allowed, pre-exisiting conditions exclusions being allowed etc.). And it’s not going to be scored by the CBO before they vote on it.

Yes, call if your senators are democrats. No, don’t call senators from other states. Yes, call if your senators are republicans who always vote to hurt people and don’t care about their constituents. The stronger the push-back they get from this sort of excrement the less likely they are to succeed and the less likely they are to try to pull excrement like this in the future. They think we’re not paying attention. Prove them wrong. (And yes, you can fax.)

5. SP Says:

This is a great article. I don’t do anything sort of statistical research work, so also educational. Thank you!

It is frustrating how few people in the general population have any understanding of statistics and evidence-based approaches to things. T and I had a discussion over this, and I wondered if teaching practical statistical concepts in high school would be more useful than teaching geometry to the average person. (I think you’d have to at least understand algebra I and functions, but maybe not trig or geometry.) My use case was educating people enough to be able to evaluate headlines and articles that purport things. This came up after a disagreement with someone about whether or not a medical hypothesis was valid. I acknowledged there was some anecdotal evidence, but the single peer reviewed study showed no statistical significance, a low sample size, and it wasn’t a randomized controlled trial. She argued that she had seen several instances of it helping in people she knew (which is… not an RCT and is the definition of anecdotal evidence!).

The human brain puts so much weight on personal experience, and not much on the scientific method. I wish high school could address this for the average reasonably intelligent person who is going into fields other than science. In the old days, we generally would trust on our doctors, so only they had to be trained to sort through evidence and research and make recommendations. Now, people feel empowered because we have the internet to get access to information (articles by authors whose credentials we don’t understand, or the studies). Yet very few people are able to properly evaluate the information they find.

• nicoleandmaggie Says:

It is frustrating. But geometry is important too! Learning how to think the way proofs make you think makes it easier to understand computer programming, and knowing basic geometry is useful for a lot of mechanical stuff (from home projects to engineering).

I teach the availability heuristic (along with other heuristics and biases) in my stats classes in the middle of the semester as a fun thing.

Sadly, many doctors are not really trained to sort through evidence and research. I could tell you some horror stories about medical journals and articles accepted into them. (One of my friends even begged an MD to keep her name off an accepted paper because it was so bad!)

6. First Gen American Says:

I have an ask the grumpiest question. I would love a post on savings bonds as a vehicle for college savings…pros and cons vs 529. Also is one better for high earners. There seems to be some language about earning limits on the tax deductability of the earnings but no penalty if not used for educational expenses.

This site uses Akismet to reduce spam. Learn how your comment data is processed.