We Should Redefine Statistical Significance

July 23rd, 2017,

Researchers representing a wide range of disciplines and statistical perspectives—72 of us in total—have posted a new paper on PsyArXiv describing a place of common ground.  We argue that statistical significance should be redefined. The paper is forthcoming in Nature Human Behavior.

For claims of discoveries of novel effects, the paper advocates a change in the P-value threshold for a “statistically significant” result from 0.05 to 0.005. Results currently called “significant” that do not meet the new threshold would be called suggestive and treated as ambiguous as to whether there is an effect. The idea of changing the statistical significance threshold to 0.005 has been proposed before, but the fact that this paper is authored by statisticians and scientists from a range of disciplines—including psychology, economics, sociology, anthropology, medicine, epidemiology, ecology, and philosophy—indicates that the proposal now has broad support.

The paper highlights a fact that statisticians have known for a long time but which is not widely recognized in many scientific communities: evidence that is statistically significant at P = 0.05 actually constitutes fairly weak evidence. For example, for an experiment testing whether there is some effect of a treatment, the paper reports calculations of how different P-values translate into the odds that there is truly an effect vs. not. A P-value of 0.05 corresponds to odds that there is truly an effect that range, depending on assumptions, from 2.5:1 to 3.4:1. These odds are low, especially for surprising findings that are unlikely to be true positives in the first place. In contrast, a P-value of 0.005 corresponds to odds that there is truly an effect that range from 14:1 to 26:1, which is far more convincing. 

An important impetus for the proposal is the growing concern that there is a “reproducibility crisis” in many scientific fields that is due to a high rate of false positives among the originally reported discoveries. Many problems (such as multiple hypothesis testing and low power) have contributed to this high rate of false positives, and we emphasize that it is important to address all of these problems. We argue, however, that tightening the standards for statistical significance is a simple step that would help. Indeed, the theoretical relationship between the P-value and the strength of the evidence is empirically supported: the lower the P-value of the reported effect in the original study, the more likely the effect was to be replicated in both the Reproducibility Project Psychology and the Experimental Economics Replication Project

Lowering the significance threshold is a strategy that has previously been used successfully to improve reproducibility in several scientific communities. The genetics research community moved to a “genome-wide significance threshold” of 5×10-8 over a decade ago, and the adoption of this standard helped to transform the field from one with a notoriously high false positive rate to one with a strong track record of robust findings. In high-energy physics, the tradition has long been to define significance for new discoveries by a “5-sigma” rule (roughly a P-value threshold of 3×10-7). The fact that other research communities have maintained a norm of significance thresholds more stringent than 0.05 suggests that transitioning to a more stringent threshold can be done.

Changing the significance threshold from 0.05 to 0.005 carries a cost, however: Apart from the semantic change in how published findings are described, the proposal also entails that studies should be powered based on the new 0.005 threshold. Compared to using the old 0.05 threshold, maintaining the same level of statistical power requires increasing sample sizes by about 70%. Such an increase in sample sizes means that fewer studies can be conducted using current experimental designs and budgets. But the paper argues that under realistic assumptions, the benefit would be large: false positive rates would typically fall by factors greater than two. Hence, considerable resources would be saved by not performing future studies based on false premises. Increasing sample sizes is also desirable because studies with small sample sizes tend to yield inflated effect size estimates, and publication and other biases may be more likely in an environment of small studies.

In research communities where attaining larger sample sizes is simply infeasible (e.g., anthropological studies of a small-scale society), there is a related “cost”: most findings may no longer be statistically significant under the new definition. Our view is that this is not really a cost at all: calling findings with P-values in between 0.05 and 0.005 “suggestive” is actually a more accurate description of the strength of the evidence.

Indeed, the paper emphasizes that the proposal is about standards of evidence, not standards for policy action nor standards for publication.  Results that do not reach the threshold for statistical significance (whatever it is) can still be important and merit publication in leading journals if they address important research questions with rigorous methods.  Evidence that does not reach the new significance threshold should be treated as suggestive, and where possible further evidence should be accumulated. Failing to reject the null hypothesis (still!) does not mean accepting the null hypothesis.

The paper anticipates and responds to several potential objections to the proposal. A large class of objections is that the proposal does not address the root problems, which include multiple hypothesis testing and insufficient attention to effect sizes—and in fact might reinforce some of the problems, such as the over-reliance on null hypothesis significance threshold and bright-line thresholds. We essentially agree with these concerns. The paper stresses that reducing the P-value threshold complements—but does not substitute for—solutions to other problems, such as good study design, ex ante power calculations, pre-registration of planned analyses, replications, and transparent reporting of procedures and all statistical analyses conducted.

Many of the authors agree that there are better approaches to statistical analyses than null hypothesis significance testing and will continue to advocate for alternatives. The proposal is aimed at research communities that continue to rely on null hypothesis significance testing at a 0.05 threshold; for those communities, reducing the P-value threshold for claims of new discoveries to 0.005 is an actionable step that will immediately improve reproducibility. Far from reinforcing the over-reliance on statistical significance, we hope that the change in the threshold—and the increased use of describing results with P-values between 0.05 and 0.005 as “suggestive”—will raise awareness of the limitations of relying so heavily on a P-value threshold and will thereby facilitate a longer-term transition to better approaches.

The proposed switch to a more demanding P-value threshold involves both a coordination problem (what threshold to use?) and a free-riding problem (why should I impose a more stringent threshold on myself unless others do?). The aim of the proposal is to help coordinate on 0.005 and to discourage free-riding on the old threshold. Ultimately, we believe that the new significance threshold will help researchers and readers to understand and communicate evidence more accurately.

Recent Posts