“You interpret [your data] in various different ways until you find a positive, statistically significant result in favor of some variant of your original hypothesis.”
First, we’re going to get a sense of what “statistically significant” means. Imagine you flip a coin five times in a row. You’re doing this because you suspect that the coin is biased, and you want to test to see if that’s the case. The hypothesis that the coin is fair is your null hypothesis. What you’re looking for is the probability that the null hypothesis is correct; this is defined as the p-value. The lower the p-value is, the more license you have to reject the null hypothesis, and therefore accept an alternative hypothesis–in this case, that the coin is biased. If you get a result that’s p < 0.05 (the typical threshold for statistical significance) then this means there is a < 5% chance that the coin is fair.
If you read that paragraph and nodded along, because based on what you remember from your intro statistics class that seems to be an accurate description of a p-value, then you’re in the majority.2
But I lied. That’s actually not what a p-value is.
A p-value measures the probability of getting data at least as extreme as the data received given that the null hypothesis is true. If I flipped 5 heads in a row, then the probability of getting that result, if the coin was fair, would be p = (½)5 = 0.03. But this does not mean that there’s only a 3% chance that the coin was fair.
The p-value measures the probability of getting the data we got (or more extreme data), given the null hypothesis being true. But the measure that we care about more, and the measure that the p-value too often gets interpreted as, is the probability of the null hypothesis being true, given the data.
We care about P(null hypothesis|data).3 The p-value gives us P(data|null hypothesis). These two quantities are not equivalent: P(A|B) ≠ P(B|A). The probability of you owning a deck of cards given that you’re a magician (probably pretty high) is not the same as the probability of you being a magician given that you own a deck of cards (probably pretty low). This is a subtle but important distinction.
This doesn’t mean that p-values are worthless; P(A|B) and P(B|A) are related to each other. But it does mean that p-values can’t give us the full story, which is one of the reasons why scientists are talking about moving away from the use of p-values. One journal has banned p-values altogether. Unfortunately, change in scientific communities often happens slow; the dangers of overusing p-values have been talked about for at least 30 years. For now, we’re stuck with the reality of journals often using p-values as easy metrics to discriminate between publish-worthy and non-publish-worthy potential papers.
Driven by this reality, up-and-coming scientists sometimes try to interpret their data favorably to ensure that their p-value gets below the acceptable p < 0.05 “bright line” for publication. This is a phenomenon known as p-hacking. There are a number of different ways that scientists can do this:
- Stop data collection whenever you hit significance. You initially plan to flip your coin ten times. You flip your coin five times, get five heads, and then stop flipping because you worry that if you flip the coin more times you’ll start getting tails. Besides, five heads in a row is enough to get statistical significance. Coin is biased, p = 0.03!
- Subgroup analysis: Post-data collection, divide your results into different subgroups to see if any one of them comes up significant. You flip a coin ten times, but sporadically over separate days, and get five heads and five tails. While this may seem like the coin isn’t biased, that’s only because you aren’t looking hard enough. You look more closely at the data and notice that all five heads came on either a Monday or Wednesday. That can’t be a coincidence. Coin is biased Mondays and Wednesdays, p = 0.03! (Saved for a separate publication: coin is biased the opposite direction on Tuesdays, Thursdays, Fridays, Saturdays, and Sundays, p = 0.03!)
- Collect data for many different measures, but only report the ones that are significant. This one is hard to explain using coin-flipping, so we’re going to switch analogies here. Say that you want to test whether singing “Hakuna Matata” during your morning shower makes your evening run go better. You decide to test three different measures of “better”: the distance you ran, subjective self-report of how you felt afterward, and the number of times you had to stop for a rest. While nothing of interest comes up for the first two measures, you find that you had to rest less on Hakuna-Matata days than on control days, p < 0.05. Success!
There are more, but let’s just stick to these three. Each of these tricks works by essentially giving you multiple opportunities to find significance, even if there is no actual result there at all. Every analysis of your data during data collection to see if it’s significant yet, every subgroup that you draw, and every extra dependent variable you measure gives you an extra shot on goal. This is problematic; the whole point of significance testing is that, if you only get one shot, you should get p < 0.05 with random data only 5% of the time, so that when you do achieve significance, this is a sign that there might be some real effect. But by utilizing all three of these techniques, you can achieve significance from random data 31% of the time (details here).
So you could have almost a one-in-three chance of finding significance out of complete air.
What does this kind of data massaging look like in real life? The best example of this I could find comes from a clinical trial in which the drug celecoxib (a.k.a. Celebrex) was tested against two other drugs (active controls) to see if it would result in less gastrointestinal side effects, specifically complicated ulcers. As you’ll see, this trial has the not-so-honorable distinction of showcasing all three of the p-hacking techniques introduced above.
In this trial, treatment was given to 8,000 patients; 4,000 patients received celecoxib, and the other 4,000 received one of the two other active controls. The patients were followed for 12 months, and any complicated ulcers that developed in the patients over that time period were recorded. A p-value could then be calculated, which would tell us whether or not there was a statistically significant difference between celecoxib and the active controls. Seems simple enough, right?
The results were dismal. Over the 12-month period, over all the patients, there was not even close to a statistically significant difference between celecoxib and the active controls (p = 0.450). So the researchers cut their losses and moved on to…
Wait, no, P-HACKING TO THE RESCUE!
Some of the patients were taking aspirin at the same time, and some weren’t, so what if we just divide them into two subgroups and test for statistical significance again? I know that we didn’t say anything about doing subgroup analysis beforehand, but I mean if you just look at non-aspirin users, we get closer to statistical significance: p = 0.182. Not quite there yet, but getting closer…
I know! Instead of taking the full year, we can just look at the first six months. Yeah, yeah, the full triaI was 12 months so we’re supposed to just report the data for that, but if you look at non-aspirin users over the first six months, there is a difference between celecoxib and the other drugs, with p = 0.037. STATISTICAL SIGNIFICANCE!
Even better, we can add another measure. If we add together complicated and uncomplicated ulcers–even though the FDA was very clear with us that they only cared about complicated ulcers–for non-aspirin users over 12 months, we fly through statistical significance with p = 0.020. We’ll just publish these lower p-values and stay quiet about that whole p = 0.450 thing.
…And that is how you p-hack your way to showing that your drug lowers incidence of gastrointestinal complications by half compared to active controls, while the original trial as planned didn’t show jack shit. What’s more, this wouldn’t even have been known if somebody hadn’t looked through the unpublished full trials on the FDA website.
Meanwhile, doctors using these kinds of studies can be misled into prescribing inferior drugs to their patients. These things matter.
But anyway, anyone can come up with one p-hacked study to prove their point. How often does p-hacking occur in the sciences at large?
Well, why don’t we ask the scientists themselves? John et al. surveyed over 2,000 research psychologists for self-reports of ten questionable research practices. While psychology isn’t exactly the best standard-bearer for all of science, the results obtained were shocking even as an upper bound.
The self-admitted rate of “deciding whether to collect more data after looking to see if the results were significant” (p-hacking technique #1)? Fifty-six percent. The rate of “failing to report all of a study’s dependent measures” (p-hacking technique #3)? Sixty-three percent. The one silver lining is that outright falsification of data was only self-reported at 0.6%. Note that these are, if anything, underestimates due to social desirability bias.
You might hope that these questionable research practices would be caught before publication; this is, in part, what the process of peer review is for. But these p-hacking techniques can be very hard to catch; how do you tell if someone decided to stop data collection early, if they hadn’t before written down anywhere how much data they were planning to collect? As we’ll see next, getting the stamp of peer review is no guarantor of study validity.
3Notation: “P(A|B)” means “the probability of event A, given condition B.”↵