Meta-Science 101, Part 7: The replication crisis

“Your work never gets replicated, but if it was, the replication might not have confirmed your finding.”

In the last section, I ended off with a graph showing that higher-impact-factor journals have more retractions.  But it’s not just retracted papers we have to worry about; a large percentage of still-out-there-and-published research could have huge problems that undermine the legitimacy of its claims.

And by large percentage, I mean a large percentage.  Fun fact alert!  Did you know that the most widely-cited paper ever published in the journal PLOS Medicine claims that “most published research findings are false”?

In his now-famous 2005 paper, Stanford professor John Ioannidis–yes, this is the guy from the intro–models the probability that a single statistically significant research finding is true with a mathematical formula.  He takes into account three different variables: the pre-study probability that whatever is being tested is true, the bias of the researchers (modeled as the percentage of findings that would not otherwise be statistically significant, and yet get p-hacked into statistical significance), and the likelihood of rejecting the null hypothesis if whatever is being tested is true (a.k.a. the statistical power of the study).8  You can look at the paper yourselves to follow the math, but the upshot is this: It is pretty difficult to find combinations of the three variables that push the post-study probability above 50%.  It takes a combination of relatively high pre-study odds, low bias, and high statistical power to attain that–a combination that is relatively rare in modern science.

Let’s concretize this with an example.  Say that you’re a genomics researcher testing one particular gene to see if it is associated with schizophrenia.  If only ten out of 100,000 genes are actually associated with schizophrenia, then your pre-study probability is 10/100,000 = 0.01%.  You perform a well-powered study (power = 80%), and you harbor a moderate amount of bias (30% of your findings are p-hacked below statistical significance).  After interpreting the results, you find an association with p = 0.01!  Hurrah!  What are the chances your finding is true?  Well…

*Mashes calculator buttons*

…Uh, just a paltry 0.03%.

Hmm, OK.  Clearly you’ve sinned against the Science Gods by being a biased individual.  What if you’d conducted the same analysis and gotten the same results with zero bias?

Well then the probability of your results being correct would be…0.8%.

This isn’t nothing.  With the evidence from your study, you’ve multiplied your pre-study probability by 80.  But your pre-study probability was so low that it’d take a massive amount of evidence to make a convincing case that you’ve luckily narrowed in on one of the 10 genes associated with schizophrenia.  So even a meticulously well-done study can be a slave to overwhelmingly low pre-study probability.9

Bias can also ruin the probability-of-truth of an otherwise well-done study.  Say you have a hypothesis that initially has a 1-in-4 chance of being correct.  After you conduct a well-powered (80% power), bias-free study with a statistically significant finding at p = 0.01, the post-study probability of truth is a whopping 96%.  This is science how it should be: Start with uncertainty, conduct a well-done study, find statistically significant results, and then make a valid conclusion that you can trust.  But inject a large amount of bias (50% of findings p-hacked), and this post-study probability drops down to 37%–not much higher than the 25% you started with.

The third variable, statistical power, isn’t something you can rely on, either.  My two examples above generously assumed 80% power, but the power in the social sciences on average has been something like 24% for 60 years now, despite calls for improvement.

Let’s return to you-as-genomics-researcher.  Imagine that, after you’ve published this study claiming that gene X is associated with schizophrenia, somebody else gets suspicious and decides to repeat your exact procedure.  Unlike you, they don’t find any association.  And of course they don’t, because your findings were only 0.8% likely to be correct in the first place, even under the least biased of circumstances.  They publish, casting doubt on your original findings.  (Epilogue: You get into a huge fight with the replicator, starting with mudslinging editorials in Science magazine and ending with a gladiator-style fight-to-the-death.  Hey, it’s my hypothetical scenario, I get to do what I want.)

This is why replications of scientific studies are important.  They allow us to distinguish between studies that got positive results due to bias/chance alone (which Ioannidis’s paper showed was fairly common) and those that got positive results due to something real in the world.  If a study found a real relationship between two variables, presumably a second study should also find that same relationship; if it doesn’t, this suggests that perhaps the first result was spurious.  Even a single replication can be dramatically useful in shoring up uncertainty about a scientific study.

Now, in reality, there are people that go looking for genes associated with diseases; this is the field of genetic association research.  They do better than in my hypothetical situation, since they aren’t just picking genes at random out of a hat to investigate; presumably, they have some theoretical backing that points to one particular gene or another, raising the pre-study probability a decent amount.  But still, replications of genetic association studies often fail; this review of genetic association studies finds that only 6 out of 166 genetic associations could be consistently replicated.  This meta-analysis of 36 genetic associations reaches a similar conclusion, with only a modest correlation found between results of the first studies of a genetic association and the results of subsequent studies.

Other fields don’t fare much better.  In economics research, about a third to a half of papers are not reproducible.  In hematology and oncology, only 6 out of 53 “landmark” studies (11%) were successfully replicated.  And of course, you may be familiar with the hullabaloo surrounding the replication attempts of 100 psychology papers in 2015, which found that only 36% of the replication attempts reached statistical significance, compared to 97% of the original studies.

It goes without saying that this is a big problem.

Let’s not get too down on science, though.  What this tells us is not that we should distrust all science, but that we should think of science as an epistemic pyramid, with the trustworthiness of claims increasing as you climb to the top.  At the bottom, largest level of the pyramid, you have bad studies, which may be biased, p-hacked, and statistically underpowered.  According to Ioannidis, these studies are probably mostly wrong.

Fine.  We’ll just climb upward.

One level up, we have studies published in top journals.  But the replication failures that we just talked about were mostly on studies published in top journals, and we know that retraction rates are higher in top journals, as well.  So we can’t stop here.

Let’s keep climbing.

On the third level, we have meta-analyses and systematic reviews.  But we’re not quite safe here, either.  First, there’s always the problem of just doing meta-analyses on bad studies and concluding something false because of that.  Garbage in, garbage out, as they say.  And remember reboxetine?  Publication bias can result in even meta-analyses on good studies coming to flawed conclusions.  So while there are more diamonds up here, there are still a fair amount of snakes as well.

Up, up, up we go.

Now we’re at the top level of the pyramid.  Ah, safe at last!  Here we find really good meta-analyses (and meta-meta-analyses), those that summarize high-quality, non-p-hacked studies, and that use statistical tools to test for publication bias and correct for it if present.  There can’t be any snakes up here, right? …Right?

Continue to Part 8: Snakes at the top >>>

8He also includes a variable to take into account publication bias, but we’ll leave that out for simplicity.

9Another way to have low pre-study odds would be to test a hypothesis that goes against scientific consensus; if you were to test the hypothesis that climate change wasn’t happening at all, then your pre-study odds would be pretty low, for example.  But if you were to find a statistically significant result and publish it, that would be exactly the kind of publication that you’d expect to make it into high-profile journals!  “NEW WELL-DONE STUDY OVERTURNS SCIENTIFIC CONSENSUS (p = 0.01)!”  The more surprising a result, the more likely it’ll get attention, and yet a more surprising result must have had lower odds to begin with–otherwise it wouldn’t be surprising–and we know lower pre-study odds leads to a lower probability of truth.  So here’s another reason why high-profile journals might have more retractions.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s