Following the advent of empiricism and the Age of Enlightenment, the collective of human knowledge has grown exponentially. The scientific method has afforded us with the inventions of antibiotics and vaccines and revealed the secrets of the atom and the genome. At its heart, science is the investigation of truth, empowered by statistical tests and reproducible methodology. However, at a time when more money, manpower, and manuscripts than ever before are being sunk into research, the integrity of our findings has never been more at stake.
A critical aspect of any potential scientific discovery is replicability. If something is true, then multiple individuals or labs should be able to independently verify its truth as well. Yet over a staggering 70 percent of 1,576 scientists surveyed in a 2016 Nature article reported having failed to reproduce an experiment. Psychology and medicine are two high-profile fields where replication and validity are rampantly challenged. One Science study that attempted to replicate the results of 100 papers published in 2008 from three high-profile psychology journals was only able to successfully reproduce 39 percent of the original findings. Meanwhile, in preclinical medical studies, over 50 percent were estimated to be irreproducible. This is evidently problematic and may prove extremely costly, hindering scientific progress by not only denying resources to more rigorous studies, but also consuming time and grants dedicated to subsequent studies that attempt to build on fallacious claims.
Many fields use an alpha level (α-level) of 0.05 to assess statistical significance—that is to say, if the probability of observing the collected data under a pre-defined “null” hypothesis is less than 0.05, then it is rejected and the “alternative” hypothesis, which is usually of greater interest, is supported. Under this framework, it would be expected that 5 percent of reported findings are not “true,” and thus non-replicable. Which causes may be responsible for the discrepancy between this theoretical minority and the observed almost-majority of irreplicability?
One possible factor is selection bias at the publication level. Even—and perhaps especially—the most reputable journals are likely to prefer studies that disclose novel, positive findings than ones that reveal that the null hypothesis (i.e. that nothing of interest is happening) is supported. This leads to selective acceptance of manuscripts that have been screened for significance. In fact, there is evidence to suggest that papers that fail to be replicated are more highly cited than replicable papers, a phenomenon that may be ascribed to the juicier headlines associated with articles advertising shakier claims. Yet, if all failures of replication could be attributed to this explanation, this would not necessarily be a detriment to science. It has been argued by Jeffrey Mogil, a Canada Research Chair holder, that the nature of pushing boundaries in science leads to lower than 95 percent replicability, especially when considering the arbitrary nature of α = 0.05 as a cut-off. Discoveries that fall just below a p-value of 0.05 may appear to fail to replicate on another try by falling just above the cut-off.
Another less honest aspect of replication may be shady scientific methodology. One manifestation of this is p-hacking, a practice in which experimenters preferentially use data analysis approaches that produce more desirable p-values. This has the effect of exaggerating the strength of a proposed effect or finding connections where there aren’t any—for example, by dropping measurements deemed invalid or failing to correct statistical tests for the effect of multiple tests. Alternatively, it may take the form of rearranging a study’s chronology, reframing confirmatory analyses as exploratory ones to strengthen the persuasive power of a paper. As careers, grants, and reputations are placed on the line, the incentive to smudge data in order to better fit one’s desired hypothesis only grows.
Several possible solutions have been suggested to combat this replication crisis. In preregistration, scientists report ahead of time a detailed plan of their study, including hypotheses, data collection and curation methods, statistical tests, etc. Thus, p-hacking is minimized by reducing the number of steps in the scientific process that may be influenced by personal bias. Changes to methodology may still be implemented, but they must be openly reported and justified alongside the preregistration report. The natural extension to this approach is registered reports, in which journals guarantee publication of results associated with a pre-defined, proposed research methodology and analysis to be conducted, thus also reducing selection bias.
Only time will tell whether these approaches result in more reproducible science. The number of preregistrations is increasing year after year, and increasing awareness surrounding issues of replication has led to greater education about statistics in the scientific community. As students and potential future academics, we too have a responsibility to be cognisant and avoid pitfalls, not just in science, that might compromise the integrity of the work we do.