(Comic taken from xkcd, here)
Part 1: An introduction
Science is in dire straits.
As you may know, a white man in a position of power has recently been criticizing science relentlessly, and it seems like that will continue to be the norm for at least the next four years. His actions seem informed by a worldview in which science is mostly false or useless, provoking strong reactions from scientists worldwide. Most troubling of all, he’s just getting started; there’s no telling what he’ll do next, emboldened by his newly acquired institutional power.
You all know who I’m talking about. That’s right: John Ioannidis.
Contrary to the misleadingly-worded intro paragraph, John Ioannidis is not someone out for the blood of scientists; rather, he’s a Stanford professor who’s played a huge role in bringing to light a multitude of problems entrenched in modern scientific practice, and he’s dedicated his career to figuring out how to fix these problems. And he’s not alone; Ioannidis is just one representative of a larger movement in the scientific community, which has become more self-critical and introspective in recent years. If you’ve heard talk of “p-hacking” or “the replication crisis” recently, this is why.
This post is an attempt to synthesize all the problems in science that have surfaced as a result of this scientific self-reflection. If we can call this movement meta-science, then welcome to Meta-Science 101: a whirlwind tour through the biases and flaws that currently plague science, and the result of the past month or so of me diving into this topic.
Let’s start off with how science is supposed to work. You’re a scientist, and you have a hypothesis. You test the hypothesis with a well-designed experiment, and your results come back; these results give you an insight into how the world works. Without any fudging of data, you publish your result in a journal, and await the results of peer review. The scientific community judges whether or not your work is up to snuff, and the work gets published or not, accordingly. If your work is published, it gets replicated by another lab, confirming your original result. Rinse and repeat across millions of scientists practicing worldwide, shower in the output of true knowledge generated.
Now let’s take a look at how science too often actually works. You’re a scientist, and you have a hypothesis. You really like said hypothesis, so you design an experiment to test it. The results come back, but they’re not super clear, so you interpret it various different ways until you find a positive, statistically significant result in favor of some variant of your original hypothesis. Researchers in your field receive your manuscript for peer review, and make semi-arbitrary recommendations to the editor. In the end, you publish, so your boss is pleased, and more importantly the grant committee funding your boss’s proposals is pleased, guaranteeing you funding for a while. Phew! Good thing you got that positive result.
Meanwhile, alternate-universe-you that didn’t find a statistically significant result doesn’t publish. The results sit in a file drawer.
But anyway, this-universe-you is happy that you published. Your work never gets replicated, but if it was, the replication might not have confirmed your finding.
…There’s a ton of stuff packed into those last three paragraphs, so we’ll spend the rest of the post elaborating point by point. First up, confirmation bias.
Part 2: Confirming confirmation bias
“You’re a scientist, and you have a hypothesis. You really like said hypothesis…”
Believe it or not, simply liking a hypothesis is enough to potentially bias the results of a study toward your favored hypothesis. This is a well-studied phenomenon known as researcher allegiance bias, and has most commonly been associated with the field of psychotherapy.
In psychotherapy, there are multiple different ways of treating a patient with a mental disorder. For example, for treating somebody with a fear of spiders, you could tell the patient to identify their negative thoughts surrounding spiders, find those thoughts to be irrational, and replace them with more realistic thoughts. This would be cognitive therapy. Alternatively, you could just show them pictures of spiders until they’re comfortable with that, and then gradually work through scarier and scarier things until you’re dumping buckets of spiders over their heads. This would be (a caricature of) systematic desensitization.
Naturally, different researchers have different favorite techniques, and multiple studies have been done comparing these techniques against each other. Unfortunately, these studies were apparently hopelessly confounded with researcher allegiance bias, as this discussion of researcher allegiance bias puts well:
“Among studies by investigators identified as favoring cognitive therapy, cognitive therapy emerged as superior; correspondingly, systematic desensitization appeared the better treatment among studies by investigators classified as having an allegiance to systematic desensitization.
What made this pattern especially striking was that the analysis involved comparisons between the same two types of therapy, with the allegiance of the researchers as the only factor known to differ consistently between the two sets of studies.”
(The original meta-analysis that this discussion refers to can be found here).
It’s not a good thing when the theoretical inclinations of a researcher can reliably predict the outcome of a study. Remember that whole thing about scientific results supposedly being a reflection of how the world works? Yeahhhhhh.
When I said that researcher allegiance bias was a well-studied phenomenon, I meant it. The above-mentioned meta-analysis (a systematic overview of primary studies) that found researcher allegiance bias is just one of dozens of meta-analyses done on the topic. So what does one do when one has dozens of meta-analyses? That’s right: a meta-meta-analysis! In 2013, Munder et al. conducted a meta-analysis of 30 different meta-analyses and confirmed that there was a “substantial and robust” association between researcher allegiance and study outcome.
But here’s where it gets crazy. Munder et al. also found that meta-analyses whose conductors were in favor of the researcher allegiance bias hypothesis–that is, the hypothesis that research allegiance bias is associated with study outcomes–found a greater association between researcher allegiance bias and study outcomes. In other words, the meta-analyses on researcher allegiance bias themselves were confounded by researcher allegiance bias.1
Note that researcher allegiance bias doesn’t necessarily have to involve conscious intent to manipulate data in favor of your own favorite psychotherapy treatment. More likely, subtle things like how the researcher designs the competing treatment protocols, how the researcher trains the therapists that will actually be carrying out the treatments, etc. are operative here. But this just makes the problem of researcher allegiance bias even scarier; what we have to do battle with is not bad actors, but rather fundamental aspects of human psychology. There have been a number of suggestions on how to moderate the effects of researcher allegiance bias (the same source I quoted above has a good discussion at the end), but I won’t talk about them here, as this blog post is already going to be long enough without addressing fixes of science as well.
Being biased towards one hypothesis over another doesn’t just play itself out in the phenomenon of researcher allegiance bias, however. Perhaps even more powerful than personal inclination is financial interest; when you have a direct financial stake in seeing the results of your study go one way rather than another, this can have a strong biasing effect.
The most well-researched example of this involves comparing industry-funded clinical trials to independently-funded trials. If financial interests play a role in biasing research results, we would expect industry-funded trials to show more positive results for the industry sponsor’s drugs than independently-funded trials. This would be particularly real-world relevant if true, since drug and device companies now fund six times more clinical trials than the federal government.
Since the last time we looked at a meta-meta-analysis went so well, why don’t we do it again?
This meta-meta-analysis, published by the extremely well-regarded independent organization Cochrane in 2012, looked at 48 different papers, each of which themselves compared industry-funded studies to non-industry-funded studies; the total number of primary studies encompassed by this review numbered nearly 10,000. The authors concluded that industry-funded studies were 32% more likely to find that the drug tested was effective, 87% more likely to find that the drug wasn’t actively harmful, and 31% more likely to come to an overall favorable conclusion for the drug. These results were more or less in line with several previous meta-meta-analyses done on this topic (yes, there have been several).
Like with researcher allegiance bias, industry sponsorship bias seems to often be instantiated via study design. For example, this can be done with more frequent testing against placebos than against active controls, resulting in an easier bar to clear for a drug to be considered “effective” by the study, or by using lower doses to mask adverse effects of the drug. Whether or not these are conscious study design choices to boost the desirability of a drug I’ll leave up to the reader to decide; the bottom line is that, regardless, we know that industry funding introduces a real bias that ends up affecting the results of a study.
Part 3: P-hacking your way to publication
“You interpret [your data] in various different ways until you find a positive, statistically significant result in favor of some variant of your original hypothesis.”
First, we’re going to get a sense of what “statistically significant” means. Imagine you flip a coin five times in a row. You’re doing this because you suspect that the coin is biased, and you want to test to see if that’s the case. The hypothesis that the coin is fair is your null hypothesis. What you’re looking for is the probability that the null hypothesis is correct; this is defined as the p-value. The lower the p-value is, the more license you have to reject the null hypothesis, and therefore accept an alternative hypothesis–in this case, that the coin is biased. If you get a result that’s p < 0.05 (the typical threshold for statistical significance) then this means there is a < 5% chance that the coin is fair.
If you read that paragraph and nodded along, because based on what you remember from your intro statistics class that seems to be an accurate description of a p-value, then you’re in the majority.2
But I lied. That’s actually not what a p-value is.
A p-value measures the probability of getting data at least as extreme as the data received given that the null hypothesis is true. If I flipped 5 heads in a row, then the probability of getting that result, if the coin was fair, would be p = (½)5 = 0.03. But this does not mean that there’s only a 3% chance that the coin was fair.
The p-value measures the probability of getting the data we got (or more extreme data), given the null hypothesis being true. But the measure that we care about more, and the measure that the p-value too often gets interpreted as, is the probability of the null hypothesis being true, given the data.
We care about P(null hypothesis|data).3 The p-value gives us P(data|null hypothesis). These two quantities are not equivalent: P(A|B) ≠ P(B|A). The probability of you owning a deck of cards given that you’re a magician (probably pretty high) is not the same as the probability of you being a magician given that you own a deck of cards (probably pretty low). This is a subtle but important distinction.
This doesn’t mean that p-values are worthless; P(A|B) and P(B|A) are related to each other. But it does mean that p-values can’t give us the full story, which is one of the reasons why scientists are talking about moving away from the use of p-values. One journal has banned p-values altogether. Unfortunately, change in scientific communities often happens slow; the dangers of overusing p-values have been talked about for at least 30 years. For now, we’re stuck with the reality of journals often using p-values as easy metrics to discriminate between publish-worthy and non-publish-worthy potential papers.
Driven by this reality, up-and-coming scientists sometimes try to interpret their data favorably to ensure that their p-value gets below the acceptable p < 0.05 “bright line” for publication. This is a phenomenon known as p-hacking. There are a number of different ways that scientists can do this:
- Stop data collection whenever you hit significance. You initially plan to flip your coin ten times. You flip your coin five times, get five heads, and then stop flipping because you worry that if you flip the coin more times you’ll start getting tails. Besides, five heads in a row is enough to get statistical significance. Coin is biased, p = 0.03!
- Subgroup analysis: Post-data collection, divide your results into different subgroups to see if any one of them comes up significant. You flip a coin ten times, but sporadically over separate days, and get five heads and five tails. While this may seem like the coin isn’t biased, that’s only because you aren’t looking hard enough. You look more closely at the data and notice that all five heads came on either a Monday or Wednesday. That can’t be a coincidence. Coin is biased Mondays and Wednesdays, p = 0.03! (Saved for a separate publication: coin is biased the opposite direction on Tuesdays, Thursdays, Fridays, Saturdays, and Sundays, p = 0.03!)
- Collect data for many different measures, but only report the ones that are significant. This one is hard to explain using coin-flipping, so we’re going to switch analogies here. Say that you want to test whether singing “Hakuna Matata” during your morning shower makes your evening run go better. You decide to test three different measures of “better”: the distance you ran, subjective self-report of how you felt afterward, and the number of times you had to stop for a rest. While nothing of interest comes up for the first two measures, you find that you had to rest less on Hakuna-Matata days than on control days, p < 0.05. Success!
There are more, but let’s just stick to these three. Each of these tricks works by essentially giving you multiple opportunities to find significance, even if there is no actual result there at all. Every analysis of your data during data collection to see if it’s significant yet, every subgroup that you draw, and every extra dependent variable you measure gives you an extra shot on goal. This is problematic; the whole point of significance testing is that, if you only get one shot, you should get p < 0.05 with random data only 5% of the time, so that when you do achieve significance, this is a sign that there might be some real effect. But by utilizing all three of these techniques, you can achieve significance from random data 31% of the time (details here).
So you could have almost a one-in-three chance of finding significance out of complete air.
What does this kind of data massaging look like in real life? The best example of this I could find comes from a clinical trial in which the drug celecoxib (a.k.a. Celebrex) was tested against two other drugs (active controls) to see if it would result in less gastrointestinal side effects, specifically complicated ulcers. As you’ll see, this trial has the not-so-honorable distinction of showcasing all three of the p-hacking techniques introduced above.
In this trial, treatment was given to 8,000 patients; 4,000 patients received celecoxib, and the other 4,000 received one of the two other active controls. The patients were followed for 12 months, and any complicated ulcers that developed in the patients over that time period were recorded. A p-value could then be calculated, which would tell us whether or not there was a statistically significant difference between celecoxib and the active controls. Seems simple enough, right?
The results were dismal. Over the 12-month period, over all the patients, there was not even close to a statistically significant difference between celecoxib and the active controls (p = 0.450). So the researchers cut their losses and moved on to…
Wait, no, P-HACKING TO THE RESCUE!
Some of the patients were taking aspirin at the same time, and some weren’t, so what if we just divide them into two subgroups and test for statistical significance again? I know that we didn’t say anything about doing subgroup analysis beforehand, but I mean if you just look at non-aspirin users, we get closer to statistical significance: p = 0.182. Not quite there yet, but getting closer…
I know! Instead of taking the full year, we can just look at the first six months. Yeah, yeah, the full triaI was 12 months so we’re supposed to just report the data for that, but if you look at non-aspirin users over the first six months, there is a difference between celecoxib and the other drugs, with p = 0.037. STATISTICAL SIGNIFICANCE!
Even better, we can add another measure. If we add together complicated and uncomplicated ulcers–even though the FDA was very clear with us that they only cared about complicated ulcers–for non-aspirin users over 12 months, we fly through statistical significance with p = 0.020. We’ll just publish these lower p-values and stay quiet about that whole p = 0.450 thing.
…And that is how you p-hack your way to showing that your drug lowers incidence of gastrointestinal complications by half compared to active controls, while the original trial as planned didn’t show jack shit. What’s more, this wouldn’t even have been known if somebody hadn’t looked through the unpublished full trials on the FDA website.
Meanwhile, doctors using these kinds of studies can be misled into prescribing inferior drugs to their patients. These things matter.
But anyway, anyone can come up with one p-hacked study to prove their point. How often does p-hacking occur in the sciences at large?
Well, why don’t we ask the scientists themselves? John et al. surveyed over 2,000 research psychologists for self-reports of ten questionable research practices. While psychology isn’t exactly the best standard-bearer for all of science, the results obtained were shocking even as an upper bound.
The self-admitted rate of “deciding whether to collect more data after looking to see if the results were significant” (p-hacking technique #1)? Fifty-six percent. The rate of “failing to report all of a study’s dependent measures” (p-hacking technique #3)? Sixty-three percent. The one silver lining is that outright falsification of data was only self-reported at 0.6%. Note that these are, if anything, underestimates due to social desirability bias.
You might hope that these questionable research practices would be caught before publication; this is, in part, what the process of peer review is for. But these p-hacking techniques can be very hard to catch; how do you tell if someone decided to stop data collection early, if they hadn’t before written down anywhere how much data they were planning to collect? As we’ll see next, getting the stamp of peer review is no guarantor of study validity.
Part 4: Bias and randomness in the peer review process
“Researchers in your field receive your manuscript for peer review, and make semi-arbitrary recommendations to the editor.”
Peer review seems like a good idea. To publish in a certain journal, your work should pass a certain standard of quality associated with that journal. Peers in your field would seem like the most natural judges of your work, given that they have the expertise to do so. Peer review is also widely supported in the scientific community: as of 2008, 85% of scientists agree that peer review greatly aids in scientific communication, and 93% disagree that peer review is unnecessary.
Some in the media like to focus on peer review scandals: Sixty-four articles retracted after peer reviews found to be fake! Peer-review ring discovered in Taiwan, resulting in resignation of Taiwan cabinet minister! Chinese company found to be selling peer reviews!
But all this attention on outright fraud might obscure a perhaps more consequential question: Does peer review actually result in the selective publication of higher-quality studies?
The evidence bearing on this question is very limited. I was able to find two systematic reviews on the efficacy of peer review, the conclusions of which basically amounted to, “Uh, we don’t really know. More research is needed.” But while we don’t have much direct information on the effects of peer review on study quality, we do have some studies on other things that are relevant to that question.
Like reviewer agreement. If peer reviewers were able to reliably discern some measure of study quality in the papers they were reviewing, then you’d expect them to agree on their recommendations for papers often. But this isn’t actually what you see: A recent meta-analysis of 48 studies on reviewer agreement concluded that, overall, reviewers agreed only 17% more than would be predicted by chance alone. So while you might hope that your paper will be accepted or rejected solely based on its merits, you’ll also be contending with a significant amount of randomness.
…And bias. In a now-famous 1982 study, Peters & Ceci took 12 already-published psychology papers written by authors from prestigious psychology departments, replaced the original authors’ names and institutions with fictitious ones (e.g., “Dr. Wade M. Johnston” from the “Tri-Valley Center for Human Potential”), and re-submitted them to the same journals. As you might expect, some of the editors noticed something fishy going on; three of the papers were rejected because they were just resubmissions. For the remaining nine, you would expect a high acceptance rate if peer reviewers were judging on quality and not on authors’ institutions, given that all of these papers were accepted the first time around. But this time, 8 out of 9 were rejected! Apparently, the reviewers found serious problems with the resubmissions. Said one reviewer: “It is all very confusing…I think the entire presentation of the results needs to be planned more carefully and organized.” Said another: “It is not clear what the results of this study demonstrate…mainly because of several methodological defects in the design of the study.” And finally: “Apparently, this is intended to be a summary. However, the style of writing leaves much to be desired in terms of communicating to the reader.”
Out of the 16 reviewers on the 8 rejected papers, all 16 recommended against publication. Remember, reviewer agreement is supposed to be not much higher than would be predicted by chance! But apparently, at least in this one study, bias against non-prestigious authors and institutions was so strong that it outweighed the inherent randomness in peer review. (To be fair, Dr. Wade M. Johnston from the Tri-Valley Center for Human Potential does sound pretty sketchy.)
One neat way to uncover bias is to do a comparison of blinded and open peer reviews. We can draw an analogy here with the Pepsi Challenge. The Pepsi Challenge blinds participants so they don’t know which brands of soda they’re drinking, asks them to drink two sodas (Pepsi and Coke), and then has them say which soda they preferred. Presumably, participant preference should come down to taste alone. In an open Pepsi Challenge, where participants know which soda they’re drinking beforehand, brand affiliation can bias the results one way or another. If you run a blinded Pepsi Challenge and find that people generally prefer Pepsi, and then run an open Pepsi Challenge and find that people generally prefer Coke, then you can infer some sort of anti-Pepsi bias.
Similarly, if you find in a study that a non-US author’s abstract to an American Heart Association meeting is 22% less likely to be accepted in an open peer review than in a blind peer review, you can attribute that 22% to bias against non-US authors.4 …You can probably guess that this was a real study.
You might object that the quality of papers written by authors outside and inside the US could differ, but remember that that difference is already taken care of by this comparison of blind and open peer review. You couldn’t dispute the hypothetical finding of anti-Pepsi bias above by saying, “Well, maybe Coke does taste better,” because we already know from the blinded Pepsi Challenge that people preferred Pepsi.
A bias of 22% might seem modest, but remember that this is an effect due solely to your address, which takes up a single line in your abstract. If you could write a single line in your abstract that would make it 22% more likely for your abstract to be accepted, you would write that line every time. And now we know that line ends in “…USA.”5
Man, you can feel bad for Dr. Wade M. Johnston all you want, but at least that guy is American.6
This doesn’t mean that everything would be fine and dandy if we just started instituting double-blinded peer reviews, where author information wouldn’t be known to the reviewers. Bias can be content-based, as well. Remember researcher allegiance bias? Well, we’re about to see Researcher Allegiance Bias, 2.0: Peer-Review Edition.
In this pioneering study, 75 peer reviewers reviewed manuscripts with identical methodologies, but with results tweaked to be either “positive” (i.e., consistent with the reviewer’s perspective) or “negative” (i.e., contradicting the reviewer’s perspective). The reviewers were then asked to give their recommendation for the manuscript they reviewed. If you’ve made it this far, you’ve probably become cynical enough to guess what the study found. Per the authors:
“Identical manuscripts suffered very different fates depending on the direction of their data. When they were positive, the usual recommendation was to accept with moderate revisions. Negative results earned a significantly lower evaluation, with the average reviewer urging either rejection or major revision.”
Interestingly enough, when the reviewers were asked to rate just the methodology section on a 6-point scale, manuscripts with “positive” results received an average rating of 4.2, while manuscripts with “negative” results received an average rating of 2.4–an absolute difference of 30%. Remember, the methodologies were identical.
You can imagine how this sort of confirmation bias can quickly lead to the stagnation of a field. Once a scientific paradigm has been established, then studies confirming the paradigm will always have an advantage in publication compared to studies contradicting it. These confirmatory studies will keep piling up, while researchers deviating from the consensus will have trouble publishing, and eventually lose their funding and/or status, even if their studies were just as well-conducted. Most of the time, the prevailing paradigm is pretty much correct, since otherwise the paradigm wouldn’t have become widely accepted in the first place. But if it’s not, we need to be able to have contradictory results see the light of day to let us know that. We need the data to be able to slap us in the face. A slap in the face is pretty hard to feel when you’re wearing a helmet called Confirmation Bias.
So to wrap up this section, we started off with the question of whether or not the peer review process improved average study quality. While there’s not much direct evidence on that question–a fact that is itself troubling, given how central the peer review process is to current scientific practice–we found that the peer review process is riddled with randomness and bias of all sorts. In order for peer review to improve study quality, its algorithm needs to hew as closely as possible to, “Publish high-quality studies, reject low-quality studies.” But right now, that algorithm is pretty corrupted.
Part 5: Academic funding and the pressure to publish
“In the end, you publish…guaranteeing you funding for a while.”
“Publish or perish” has become a sort of sad mantra for academics. All across the world, university faculty are feeling more and more pressure to push out publications. Without publications, you lose your funding. Without funding, you lose your ability to pay graduate students and postdocs. And without graduate students and postdocs, you lose your ability to do good science, which can lead to less publication output, starting the cycle over again. If you’re an established professor, then this means loss of status and personal fulfillment; if you’re pre-tenure, this means you’re probably out of a job soon. This is an outline of a career path that no academic wants to follow, so you can see why everyone in academia is so determined to publish, publish, publish.
But why does it feel like things have gotten worse recently? What’s led academia to this hypercompetitive state?
I came into writing this section thinking that a lack of funding was the reason. If there’s less money to go around, then in order to win grants, academics will have to publish more to separate themselves from the crowd, leading to the “publish or perish” mindset. In support of this view, federal science funding as a percentage of GDP has steadily declined for decades now.
But take a look at absolute university science funding over the past 40 years (inflation-adjusted):
(Graph taken from NSF website, here)
What you see is a gradual increase in funding up until 2011, when you start to see a decline. But the decline doesn’t look like that much in the grand scheme of things; it can perhaps more accurately be called a plateau. The fact that federal science funding, as a percentage of GDP, has been declining has more to do with the rise in GDP than with any sort of drastic cuts in science funding. So while plateauing science funding certainly doesn’t help matters, I don’t think it can fully account for the academic funding crisis we see.
Now look at the rise in grant applications to the NIH over the past 15 years:
(Figure taken from NIH website, here)
Application rates have more than doubled since 1998 (going from 24,000 to 52,000), and the decline in success rate roughly tracks that, dropping from 31% in 1998 to 18% now. The situation at the NSF is the same: Grant applications have risen from 28,000 in 1998 to 50,000 now, and the success rate has correspondingly dropped from 33% to 24%.
So grant award rates have dropped despite an absolute increase in university science funding, due to the massive increase in grant applications. Putting on my economist hat, the current academic funding crisis is not a supply-side problem, it’s a demand-side one.
There are interesting ideas on causes and solutions, but for the purposes of this post, all we have to know is that it’s happening. More and more scientists are being trained, resulting in hypercompetition and the “publish or perish” mentality. What are the consequences?
Well, the academic funding crisis has a hand in pretty much every other problem in science covered in this post so far: Researcher allegiance bias, p-hacking, and the quality of peer review, and the yet-to-be-discussed publication bias and replication crisis. If researchers need publications to get funding, then they’ll be more likely to interpret their findings in a positive way, e.g. via p-hacking; this manifests itself in researcher allegiance bias. If the pressure to publish is high, peer reviewing becomes less and less of a priority; when PI’s finally get to doing a peer review, they go through it too quickly, letting unconscious biases creep in.
To be fair, this is a bit speculative; I don’t know of any study that explicitly links publication pressure to specifically any one of these things. But scientists themselves seem to be cognizant of the pressures and their likely consequences, and are speaking out against them. And the line between incentive and behavior here is direct enough that I’d be surprised if publication pressure weren’t affecting these things to some extent.
Plus, we do have this study, which showed that authors were more likely to report positive results in US states with a higher number of publications per-doctorate-holder-in-academia (which the authors used as a proxy measure for how much pressure there was to publish in a given state). If you’re an average academic researcher in Washington D.C., where you’re pressured to publish about 0.9 publications/year, then you’re about 4 times as likely to find a positive result in each publication than if you’re in North Dakota, where the standard is only 0.4 publications/year.
(Fig. 2 from the paper)
Of course, correlation doesn’t imply causation; you always have to be wary of confounding factors in correlational studies. The most obvious response to these findings is to say, “Well, duh. Researchers at more prestigious institutions publish more papers. And they’re better at science, so of course they’re going to find more positive results.”
The authors have two replies. Firstly, they note that, when they controlled for R&D expenditure–which you’d expect would be higher for better institutions–their finding didn’t go away. If anything, it got more statistically significant. This counts as evidence against the fact that institutional prestige is a confounding variable here.
Their second reply is to look at a state like Michigan, where the likelihood of finding a positive result was something like, um, 97%. Either Michigan researchers are inflating their results in some way, or they’re just really, really good at science. In which case, I’ll have what they’re having.
The study didn’t try to delve into what was behind this correlation, but they suspect a large part of it is due to scientists, under pressure to publish, interpreting their negative results into positive results à la p-hacking–not because the scientists involved have malicious intent, but rather because they just know there’s a positive result there, they just have to find the right interpretation and the data will show it. They also speculate that the correlation can be partly be explained by the selective publication of positive studies over negative studies, also known as publication bias. Which brings me to my next trick…
Part 6: Publication bias
“Meanwhile, alternate-universe-you that didn’t find a statistically significant result doesn’t publish. The results sit in a file drawer.”
If you’re suffering from depression, have you asked your doctor about reboxetine?
It’s a great antidepressant. I mean, have you seen the data? This meta-analysis concluded that “reboxetine is significantly more effective than placebo in a subgroup of patients with severe depression.” Another meta-analysis showed the same. This one found that the risk of adverse events–side effects like dry mouth, insomnia, etc.–was pretty much the same between reboxetine and placebo.
And these aren’t meta-analyses of just any old studies; you could imagine that a meta-analysis of p-hacked studies infused with researcher allegiance bias could lead to absurd conclusions with tenuous connections to reality. No, these are meta-analyses of mostly double-blind, randomized, placebo-controlled trials (DBRCTs)–the gold standards of clinical trials. Medical doctors get excited over these, because they know that these are the most likely to dispense accurate, truthful information about which drugs work and which don’t.
In fantasy stories, you have Oracles that will truthfully answer any question you ask. In medical research, you have DBRCTs.
So there’s really robust evidence in the medical literature showing that reboxetine is a great treatment for depression, with little side effects. Seems like a pretty boring success story for a drug…
…Until this meta-analysis by Eyding et al. in 2010 had to come along and ruin everybody’s fun. It turns out that the medical literature on reboxetine was afflicted by publication bias; that is to say, positive trials on reboxetine were selectively published, while negative trials never saw the light of day. In the 13 methodologically sound trials that were analyzed, Eyding et al. found that data on 74% of patients were left unpublished. While the three meta-analyses I cited above–drawing mostly from published data–concluded that reboxetine was an effective treatment for depression, the combined published and unpublished data didn’t show a statistically significant difference between reboxetine and placebo. In addition, the combined published and unpublished data showed that patients on reboxetine reported more adverse events than those on placebo (p < 0.001).
The authors concluded that reboxetine was “an ineffective and potentially harmful antidepressant.” Meanwhile, reboxetine had already been on the market for 13 years and had been prescribed to God knows how many patients.
The story of reboxetine is a lesson in humility. If you thought that all you had to do to get a true answer to a research question was do a good study, then the story of reboxetine says, “Nope, even DBRCTs can be wrong.” If you then thought that all you had to do was compile all the DBRCTs in the literature and see what that compilation says, the story of reboxetine says, “Nope, publication bias, motherfucker.”
And it’s not just reboxetine. The medical literature in general has a publication bias problem; this narrative review finds evidence of publication bias in studies on drugs for bipolar disorder, schizophrenia, panic disorder, Alzheimer’s disease, coronary heart disease, HIV/AIDS, ovarian cancer, multiple myeloma, osteoarthritis…Overall, several analyses have found that biomedical studies with positive, statistically significant results are at least 2x as likely to be published as those with negative, non-significant results.
Publication bias is a big deal, because being able to do a systematic overview of the literature is currently the best way we have to assess the truth of scientific claims. Since individual studies can be biased, p-hacked, and/or statistically underpowered, we need to be able to pick out good studies, leave out bad ones, and combine them in a neat package called a meta-analysis that gives us an overview of what we know on a topic. But meta-analyses can only draw from the literature that’s currently published; you can only analyze what you see. If there’s a bunch of research out there stuck in file drawers that disproportionately disputes whatever the current literature suggests, we need to be able to capture that research in a meta-analysis.
Now that we’ve learned about publication bias, as well as p-hacking, we can combine these two pieces of knowledge to make a prediction. If a) sketchy methodology is often used to get positive findings, and b) positive findings are not only more likely to get published at all, but also more likely to get published in higher-impact journals, then you might predict c) that higher-impact journals might have more instances of studies being published that exhibit sketchy methodology, and therefore might have more retractions. Now let’s see what the data seems to suggest:
(Figure taken from Nature News & Comment, here)
This is troubling, since publications in higher-impact journals are, well, higher impact; they tend to have a lot of influence on the beliefs of scientists, as well as of the citizenry at large if the study is impactful enough. So getting these high-impact studies right is pretty important.
And one good way to make sure these high-impact studies are accurate would be to have scientists double-check each other by repeating each other’s experiments–in other words, to have scientists perform replications.
Part 7: The replication crisis
“Your work never gets replicated, but if it was, the replication might not have confirmed your finding.”
In the last section, I ended off with a graph showing that higher-impact-factor journals have more retractions. But it’s not just retracted papers we have to worry about; a large percentage of still-out-there-and-published research could have huge problems that undermine the legitimacy of its claims.
And by large percentage, I mean a large percentage. Fun fact alert! Did you know that the most widely-cited paper ever published in the journal PLOS Medicine claims that “most published research findings are false”?
In his now-famous 2005 paper, Stanford professor John Ioannidis–yes, this is the guy from the intro–models the probability that a single statistically significant research finding is true with a mathematical formula. He takes into account three different variables: the pre-study probability that whatever is being tested is true, the bias of the researchers (modeled as the percentage of findings that would not otherwise be statistically significant, and yet get p-hacked into statistical significance), and the likelihood of rejecting the null hypothesis if whatever is being tested is true (a.k.a. the statistical power of the study).8 You can look at the paper yourselves to follow the math, but the upshot is this: It is pretty difficult to find combinations of the three variables that push the post-study probability above 50%. It takes a combination of relatively high pre-study odds, low bias, and high statistical power to attain that–a combination that is relatively rare in modern science.
Let’s concretize this with an example. Say that you’re a genomics researcher testing one particular gene to see if it is associated with schizophrenia. If only ten out of 100,000 genes are actually associated with schizophrenia, then your pre-study probability is 10/100,000 = 0.01%. You perform a well-powered study (power = 80%), and you harbor a moderate amount of bias (30% of your findings are p-hacked below statistical significance). After interpreting the results, you find an association with p = 0.01! Hurrah! What are the chances your finding is true? Well…
*Mashes calculator buttons*
…Uh, just a paltry 0.03%.
Hmm, OK. Clearly you’ve sinned against the Science Gods by being a biased individual. What if you’d conducted the same analysis and gotten the same results with zero bias?
Well then the probability of your results being correct would be…0.8%.
This isn’t nothing. With the evidence from your study, you’ve multiplied your pre-study probability by 80. But your pre-study probability was so low that it’d take a massive amount of evidence to make a convincing case that you’ve luckily narrowed in on one of the 10 genes associated with schizophrenia. So even a meticulously well-done study can be a slave to overwhelmingly low pre-study probability.9
Bias can also ruin the probability-of-truth of an otherwise well-done study. Say you have a hypothesis that initially has a 1-in-4 chance of being correct. After you conduct a well-powered (80% power), bias-free study with a statistically significant finding at p = 0.01, the post-study probability of truth is a whopping 96%. This is science how it should be: Start with uncertainty, conduct a well-done study, find statistically significant results, and then make a valid conclusion that you can trust. But inject a large amount of bias (50% of findings p-hacked), and this post-study probability drops down to 37%–not much higher than the 25% you started with.
The third variable, statistical power, isn’t something you can rely on, either. My two examples above generously assumed 80% power, but the power in the social sciences on average has been something like 24% for 60 years now, despite calls for improvement.
Let’s return to you-as-genomics-researcher. Imagine that, after you’ve published this study claiming that gene X is associated with schizophrenia, somebody else gets suspicious and decides to repeat your exact procedure. Unlike you, they don’t find any association. And of course they don’t, because your findings were only 0.8% likely to be correct in the first place, even under the least biased of circumstances. They publish, casting doubt on your original findings. (Epilogue: You get into a huge fight with the replicator, starting with mudslinging editorials in Science magazine and ending with a gladiator-style fight-to-the-death. Hey, it’s my hypothetical scenario, I get to do what I want.)
This is why replications of scientific studies are important. They allow us to distinguish between studies that got positive results due to bias/chance alone (which Ioannidis’s paper showed was fairly common) and those that got positive results due to something real in the world. If a study found a real relationship between two variables, presumably a second study should also find that same relationship; if it doesn’t, this suggests that perhaps the first result was spurious. Even a single replication can be dramatically useful in shoring up uncertainty about a scientific study.
Now, in reality, there are people that go looking for genes associated with diseases; this is the field of genetic association research. They do better than in my hypothetical situation, since they aren’t just picking genes at random out of a hat to investigate; presumably, they have some theoretical backing that points to one particular gene or another, raising the pre-study probability a decent amount. But still, replications of genetic association studies often fail; this review of genetic association studies finds that only 6 out of 166 genetic associations could be consistently replicated. This meta-analysis of 36 genetic associations reaches a similar conclusion, with only a modest correlation found between results of the first studies of a genetic association and the results of subsequent studies.
Other fields don’t fare much better. In economics research, about a third to a half of papers are not reproducible. In hematology and oncology, only 6 out of 53 “landmark” studies (11%) were successfully replicated. And of course, you may be familiar with the hullabaloo surrounding the replication attempts of 100 psychology papers in 2015, which found that only 36% of the replication attempts reached statistical significance, compared to 97% of the original studies.
It goes without saying that this is a big problem.
Let’s not get too down on science, though. What this tells us is not that we should distrust all science, but that we should think of science as an epistemic pyramid, with the trustworthiness of claims increasing as you climb to the top. At the bottom, largest level of the pyramid, you have bad studies, which may be biased, p-hacked, and statistically underpowered. According to Ioannidis, these studies are probably mostly wrong.
Fine. We’ll just climb upward.
One level up, we have studies published in top journals. But the replication failures that we just talked about were mostly on studies published in top journals, and we know that retraction rates are higher in top journals, as well. So we can’t stop here.
Let’s keep climbing.
On the third level, we have meta-analyses and systematic reviews. But we’re not quite safe here, either. First, there’s always the problem of just doing meta-analyses on bad studies and concluding something false because of that. Garbage in, garbage out, as they say. And remember reboxetine? Publication bias can result in even meta-analyses on good studies coming to flawed conclusions. So while there are more diamonds up here, there are still a fair amount of snakes as well.
Up, up, up we go.
Now we’re at the top level of the pyramid. Ah, safe at last! Here we find really good meta-analyses (and meta-meta-analyses), those that summarize high-quality, non-p-hacked studies, and that use statistical tools to test for publication bias and correct for it if present. There can’t be any snakes up here, right? …Right?
Part 8: Snakes at the top
Now that we’re at the top of the epistemic pyramid, let’s see what diamonds we can find up here. Here’s one systematic review that says smoking is causally related to lung cancer. Here’s another meta-analysis that fails to find a link between childhood vaccinations and the development of autism. And this meta-analysis shows that people can psychically detect the future via as-of-yet unknown nonphysical processes.
This story starts in 2011, when esteemed Cornell psychologist Daryl Bem published a paper in the Journal of Personality and Social Psychology that reported the results of nine experiments testing for the existence of “retroactive influence,” or the idea that events in the future can affect people physiologically or behaviorally in the present.
A priori, this idea seems absurd. Events in the future haven’t happened yet, so how could they affect the present? The causal arrow goes in only one temporal direction, and to find otherwise would seem to fly in the face of everything we know about physics.
But eight out of nine of Bem’s experiments found statistically significant evidence for “retroactive influence.”
Let’s get a sense of what these experiments entailed. In a standard psych experiment not testing for retroactive influence, a participant might be asked to practice rehearsing 25 randomly selected words out of a list of 50 words. Later, they would be asked to type as many of the words from the list as they could remember. Of course, as you might expect, they generally recall the words that they practiced rehearsing more readily.
Bem basically just time-reverses this protocol. The participants are still given the list of 50 words initially, but then are asked immediately in a surprise test to type as many of the 50 words they can remember. Only after that are they asked to practice rehearsing 25 randomly selected words. What Bem found is that the participants were better at recalling words in the initial surprise test if they were going to practice rehearsing them in the future.
So saying words in the future apparently helps you remember them right now. Huh?
Bem’s paper consisted of eight other experiments similar to this one, and like I said, eight out of nine found evidence of retroactive influence. Upon publication, the field of experimental psychology got all in a tizzy, because this felt like someone was making a mockery out of their field. Article after article followed with criticism of Bem’s statistical techniques, with Bem responding accordingly. Now, I’m not a statistician, so I have no idea whether or not these criticisms make valid points. But it all became moot soon, as a couple of high-profile replication attempts failed to reproduce Bem’s findings, and instead concluded that, hey, retroactive influence wasn’t a thing after all. The lesson that psychologists took away from this was This Is Why Replications Are Important, and they stopped talking about Bem not too long after.
But Bem didn’t just go away. Like a good storybook villain, he just bided his time…and returned four years later, this time even stronger.
This is where the real fun begins.
In 2015, Bem came out with a meta-analysis of 90 different studies on retroactive influence; 10 of these studies included Bem’s own, 69 were either exact or modified replications of Bem’s original studies, and 11 tested for retroactive influence in alternative ways. The upshot is that this conglomeration of studies found statistically significant evidence for retroactive influence, p = 1.2 x 10-10.
That’s not a typo. That really is p = 0.00000000012.
So psychology…you wanted replications? Well Bem just got a whole truck-full and delivered them to your front door, and together they blew through your standard threshold of p < 0.05 by a factor of almost 500 million. Satisfied yet?
A standard response here might be that p-values are suboptimal measures anyway. I mean, didn’t I have a whole thing in Part 3 about p-values measuring the probability of getting the data given the null hypothesis, not the other way around?
The most frequently espoused alternative to p-values would be to use Bayes factors. Bayes factors tell you how much more likely you are to get the evidence you did in a world where your hypothesis was true, compared to a world in which your hypothesis was false. For example, say I do a coin-flipping experiment testing to see if a coin is biased, and I flip 10 heads in a row. Without doing any math, let’s just assume the Bayes factor comes out to 5 in this case. This would mean that this result (10 heads in a row) was 5 times more likely to happen in a world where the coin was biased than in a world where the coin was fair, so I should revise my beliefs toward the hypothesis that the coin was biased (which doesn’t mean I have to change my mind entirely, just incrementally update my beliefs in proportion to the strength of the evidence). Generally speaking, obtaining a Bayes factor of 100 is considered to be decisive evidence for a hypothesis.
Bem’s hypothesis was that retroactive influence is a thing. So what was the Bayes factor in Bem’s meta-analysis?
Oh, you know, just a casual 5.1 billion.
…But p-hacking! Publication bias!! These are things that could have made this a bad meta-analysis, right?
Bem is a careful guy and thought of this, too. He carried out nine statistical tests to test for the presence of p-hacking or publication bias. Eight out of nine came up empty. The ninth, which was testing for p-hacking, didn’t prove that there was any, but rather was inconclusive.
So we’re put in the uncomfortable situation where an extremely well-done meta-analysis, certainly in the top tier of meta-analyses in general, came up with pretty damn strong evidence for a phenomenon that’s physically impossible.
We found a snake at the top of the pyramid.
Now, I don’t want to give the impression that nobody had anything to say about this. There have been a few responses trying to figure out what exactly happened here, some of which I find plausible and most of which I don’t understand.
But the larger point is this: If Bem could, in full conformity with common scientific practice, carry out a meta-analysis that is probably higher quality than 95% of meta-analyses out there and find evidence for something non-real, then what does this say about science?
And it’s not just Bem. Plenty of high-quality meta-analyses in the past have found evidence for spooky phenomena: This one finds evidence of precognition, with subjects being able to predict which of several potential targets (e.g., faces on a die, cards in a deck) will be selected by a random number generator in the future (sample size = 309 studies, p = 6.3 x 10-25, publication bias tested for and not found to be significant). And this one finds that people can influence random number generators to become non-random just by thinking about them hard enough (sample size = 832 studies, publication bias tested for and not found to be significant).
In fact, there’s a whole field of parapsychology out there, with its own journals and everything, where researchers test for the existence of psychic phenomena and are able to publish significant findings all the time. Mainstream scientists keep trying to make them go away, and parapsychologists respond that they’re following all the rules of modern science better than most scientists, so they should be allowed to play, too.
So…what does this mean for mainstream science?
A couple people have put forth the idea of thinking of parapsychology as a control group for science; basically, whatever percentage of papers are published by parapsychologists for non-real phenomena is the percentage of papers that we have to expect are wrong in mainstream science, too. I think this is a neat idea.10 As we’ve discussed in the previous seven parts of this blog post, there are a ton of problems with modern science. If all these problems are present in parapsychology, too, then parapsychology gives us a great example of the number of false findings that would be generated as a result of these problems alone.
Like a sugar pill, parapsychology is devoid of any content whatsoever, and yet it keeps producing significant effects. We’d be wise to take this base level of activity into account when evaluating the rest of science.
(…Oh, speaking of sugar pills, here’s another snake at the top that says homeopathic treatments tend to do quite a bit better than placebo. I think dealing with parapsychology is enough for one day, though, so let’s leave that one alone.)
Part 9: A conclusion
Science in theory is pretty great, but in practice, science is done by people. People have flaws, and these flaws infiltrate scientific practice. Confirmation bias translates into researcher allegiance bias; you’re more likely to obtain a positive result if you believe in the hypothesis you’re testing. The reasons for this may be due to subconscious processes and are hard to tease out, but part of it is probably due to p-hacking, or using shady statistical tricks to get your p-value down below the 0.05 threshold generally used to distinguish publishable from non-publishable results. Researchers are partly incentivized to p-hack due to the increasingly hypercompetitive space in academia, driven by the ballooning of grant applications while federal funding has plateaued. You would hope that peer review would catch studies relying on these p-hacking techniques before they’re able to contaminate the published literature, but the peer review process itself is riddled with randomness and bias, and is not exactly a reliable filter that lets only high-quality studies through. All of this means that a lot of studies don’t replicate, and most published research findings may be false. Going up to the level of meta-analysis doesn’t solve all your problems, because even good meta-analyses of good studies can be driven to the wrong conclusions due to publication bias; positive results are more likely to be included in meta-analyses because they’re actually published, while negative results languish in file drawers. And, as Bem and the field of parapsychology have been proving for many years now, even really really good meta-analyses that take publication bias into account can lead to conclusions that the rest of us are pretty sure are false, like that events in the future can reach back in time and influence us subconsciously in the present.
…Whew. Based on that summary, science seems in pretty bad shape right now; most of the research at the bottom of the epistemic pyramid is probably wrong and even the top has been infiltrated by snakes. But I don’t want to leave you with an overly skeptical view of science. There are a few caveats I have to make about everything I’ve said so far.
First: Some of you may have noticed the apparent irony in me relying on scientific studies to criticize science. The reason is this: Using science is better than not using science. Science may have its problems, but it’s still the best tool we have. I mean, when it comes to making claims about the world, we can either: a) systematically test those claims by throwing them against the world, seeing how the world responds, and trying to interpret the response the best we can, or b) rely on our intuitions, which are a product of evolutionary processes that optimized for survival and reproduction in the ancestral environment, not for truth-finding in the modern world. As bad as we might be at a), I’ll choose a) over b) any day.
Second: Science isn’t actually as bad as I made it out to be. In my digging into the meta-science literature over the past month or so, I kept finding little nuggets of good news, and I would be remiss if I didn’t mention them here. You know that meta-meta-analysis of researcher allegiance bias that I mentioned wayyyy back in Part 2, the one that found a “substantial and robust” association between researcher allegiance and study outcome? Well, by “substantial” it meant that researcher allegiance bias only explained 7% of the variance between study outcomes. This study on p-hacking found that, while p-hacking is widespread in the scientific literature, it probably doesn’t affect the conclusions of meta-analyses all that much, since p-hacking is more common in studies with small sample sizes, which are given less weight in meta-analyses anyway. This study comes to a similar conclusion on publication bias; although widespread, it may only affect the conclusions of ~10% of meta-analyses. Finally, contrary to what the Reproducibility Project found on their replications of 100 psychology papers, ~70-80% of replications in psychology overall are successful. So while science may have problems, we have to be honest about the scale and impact of these problems as well.
Third: A lot of the problems with science are concentrated in two fields: psychology and biomedicine. If you’re looking at literature outside those fields, you can be way more confident in the results. Heck, in some fields the problems I mentioned with science don’t even apply. In my own field of synthetic organic chemistry, for example, basically what I do is “make thing, report that I made thing.” There are no p-values involved, so p-hacking isn’t an issue–and researcher allegiance bias has its work cut out for it, since it would be really hard for me to convince myself that I made something that I didn’t. In general, results obtained in the hard sciences seem to me to be basically trustworthy.
Last, and this is the optimistic, future-looking conclusion I want to leave you with: Scientists are not blind to all these problems. There seems to be a general awareness that researcher allegiance bias, p-hacking, etc. are things to be dealt with, and there has been plenty of discussion in the media on how to improve each aspect of science that I talked about. Organizations have popped up that specifically focus on improving scientific practice. Remember John “most-published-research-findings-are-false” Ioannidis? He co-founded the Meta-Research Innovation Center at Stanford (METRICS) in 2014, which will aim to improve the quality of scientific research through conducting meta-science research, among other things. And Brian “40%-of-psychology-experiments-don’t-replicate” Nosek co-founded the Center for Open Science in 2013, which has a similar mission and is currently carrying out a large-scale reproducibility project on studies in cancer biology.
So I would like to think that meta-science has momentum. Science may have its problems, but scientists recognize these problems and are hard at work trying to fix them. And if there’s any group of people I would trust to solve problems like these, it’s scientists. After all, these people are ambitious, talented, insanely smart, curious, data-driven, truth-hungry. If you tell them there’s an obstacle between them and the truth about reality, you better believe that obstacle is going to come crashing down.
Biases and flaws may be viruses infecting science right now, but those viruses are soon going to find out that science has got one hell of an immune system. And as far as infections go, this is no more than a common cold; give science a little time, and it’ll come back stronger than ever. You think you’re going to keep science bedridden for long? Come onnnnn.
I look forward to the day that science has put the proper institutions and practices in place so that it’s functioning at 100%. Because then…look out, world. Science is on the move, and it’s got truth to find.
1The snarky response here is that Munder et al. were obviously biased in favor of researcher allegiance hypothesis hypothesis, the hypothesis that researchers with an allegiance to researcher allegiance hypothesis are more likely to find associations between researcher allegiance hypothesis and study outcome. Munder et al. can’t be trusted! We need a meta-meta-meta-analysis!↵
3Notation: “P(A|B)” means “the probability of event A, given condition B.”↵
4Among the pool of non-US-authors, the situation was even worse for authors from non-English-speaking countries, with an additional 15% of bias tacked on.↵
5Actually, while non-US authors were 22% less likely to have their abstracts accepted, US authors were 28% more likely to have theirs accepted, because math (the percentages differ depending on which way you phrase it).↵
6I’m assuming, here. He is fictional, after all.↵
7Contrary to popular belief, the reasons for this bias towards positive publications seems to have less to do with editorial bias against negative publications than with scientists not writing up negative results in the first place, either because they have a perception of an editorial bias against negative publications or because they just aren’t interested.↵
8He also includes a variable to take into account publication bias, but we’ll leave that out for simplicity.↵
9Another way to have low pre-study odds would be to test a hypothesis that goes against scientific consensus; if you were to test the hypothesis that climate change wasn’t happening at all, then your pre-study odds would be pretty low, for example. But if you were to find a statistically significant result and publish it, that would be exactly the kind of publication that you’d expect to make it into high-profile journals! “NEW WELL-DONE STUDY OVERTURNS SCIENTIFIC CONSENSUS (p = 0.01)!” The more surprising a result, the more likely it’ll get attention, and yet a more surprising result must have had lower odds to begin with–otherwise it wouldn’t be surprising–and we know lower pre-study odds leads to a lower probability of truth. So here’s another reason why high-profile journals might have more retractions.↵
10At least in theory, as a way of adjusting your model of science in your mind. I have no idea what it would even look like in practice.↵