Large-Scale Collaboration Releases New Findings on Research Credibility

Apr. 1, 2026

Media Contact: pr@cos.io

Results from the SCORE program, published in Nature and several preprints, assess multiple dimensions of credibility in social and behavioral science research.

Findings from the Systematizing Confidence in Open Research and Evidence (SCORE) program—a collaborative effort involving 865 researchers—have been published in Nature as a collection of three papers alongside a release of five additional preprints. The SCORE program offers new empirical evidence on the reproducibility, robustness, and replicability of research across the social and behavioral sciences, and the predictability of replicability.

The SCORE program examined the capability of humans and machines to predict the replicability of research findings. In the process, SCORE accumulated an enormous database of information about the credibility of a large sample of findings from across the social and behavioral sciences. The program’s outcomes will contribute to strengthening how research is interpreted and communicated, work that supports authors, reviewers, funders, policymakers, and readers' understanding and use of research evidence. Improving credibility assessment will help direct attention and resources for further research to where they have the greatest impact in accelerating production of knowledge and solutions.

Funded by the U.S. Defense Advanced Research Projects Agency (DARPA), SCORE is a large-scale, multi-method research initiative designed to improve how scientific credibility is assessed in the social and behavioral sciences. The program examines multiple dimensions of research repeatability—including reproducibility, robustness, and replicability—to better understand the credibility of published findings from multiple perspectives. The SCORE team sampled claims from 3,900 papers published from 2009-2018 in 62 journals spanning criminology, economics, education, finance, health, management, marketing, organizational behavior, psychology, political science, public administration, and sociology. These claims were subjected to a variety of methods of credibility assessment.

The contributions of hundreds of researchers was coordinated by several lead teams. Sampling of claims, gathering of credibility measures, and conducting of replication and reproduction studies was coordinated by the Center for Open Science (COS). Human expert assessments were conducted by two independent teams, the repliCATS project and Replication Markets, to evaluate the viability and accuracy of forecasting research replicability. Three teams led by researchers at Pennsylvania State University, TwoSix Technologies, and the University of Southern California implemented machine-learning and algorithmic approaches to predicting replicability. And, the Metascience Lab from Eötvös Loránd University coordinated the robustness assessments.

A basic contribution of the program is to affirm emerging standards for some terminology related to credibility and trustworthiness of research. Specifically, reproducibility, robustness, and replicability refer to distinct aspects of the repeatability of evidence—an important component of creating generalizable knowledge. A preprint from Nosek and colleagues explains the terminology to support clear and consistent understanding.

Across its studies, SCORE findings suggest that reproducibility, robustness, and replicability each capture distinct aspects of research credibility, and that published claims vary in how well they hold up under these distinct forms of scrutiny. The following are brief summaries of each of the three papers appearing in Nature.

Reproducibility refers to conducting the same analysis on the same data and assessing whether the finding is the same as reported in the original paper.

As reported by Miske and 127 co-authors, SCORE revealed limited transparency, which makes reproducibility and robustness assessment infeasible. Data was available for only 24% of a sample of 600 assessed papers. For the 143 papers that were subjected to reproduction tests, 74% successfully reproduced at least approximately and 54% precisely. Success was associated with how much was shared from the original paper. Approximate (91%) and precise (77%) reproducibility was highest for papers where both original data and code were shared, and lowest (38% and 11%) when reanalysis required reconstructing the original dataset from public sources (e.g., retrieving census data and reconstructing the data management and analysis steps reported in the paper).

Robustness refers to conducting alternative reasonable analyses on the same data and assessing whether the findings are similar to what was reported in the original paper.

As reported by Aczel and 490 co-authors, SCORE revealed hidden uncertainty in research findings by conducting systematic testing of analytical robustness of 100 papers. For each paper, at least five independent analysts tested the same question with the same data, applying their own decisions about how to best analyze the data. 34% of independent reanalyses revealed the same result as the original finding within a narrow tolerance range (+/- 0.05 Cohen’s d units), and 57% revealed the same result with a tolerance range four times the size. Regarding the conclusions drawn, 74% of analyses were reported to arrive at the same conclusion as in the original investigation; 24% to no effects/inconclusive result, and 2% to the opposite effect as in the original investigation.

Replicability refers to testing the same question in new data and assessing whether the findings are similar to what was reported in the original paper.

As reported by Tyner and 291 co-authors, SCORE revealed that it is challenging to replicate original findings with independent data. Of 164 papers subjected to replication attempts, 49% replicated successfully according to the most common criterion for assessing replication (statistical significance with the same pattern as the original study), and the observed effect sizes for replication studies (0.10 in Pearson’s r units) were less than half the magnitude of the original studies (0.25).

The five preprints released alongside the Nature collection provide additional evidence about credibility and predictability of research findings:

  • Abatayo and 85 co-authors combined evidence across the SCORE program and observed that measures of repeatability and credibility are only weakly related with one another. This suggests that credibility is multidimensional and it is unlikely that there are broadly applicable shortcuts to establishing the validity and reliability of research findings.

  • Mody and 33 co-authors reported evidence from two distinct methods of eliciting predictions from people about the replicability of findings: repliCATS and Replication Markets. They observed that human assessments are reasonably accurate at predicting replication outcomes (76% and 78% success rates by best performing metric for the two methods respectively).

  • Rajtmajer and 39 co-authors reported evidence from three distinct automated methods of eliciting predictions from machines about the replicability of findings: Synthetic Markets, MACROSCORE, and A+. None of the three methods were consistently effective at predicting which claims would replicate successfully or not, suggesting some caution for earlier evidence of successful use of machine methods for such assessments.

  • Eight of the leaders of the SCORE program (Nosek, Aczel, Errington, Fidler, Mody, Rajtmajer, Szászi, and Tyner) comment on the implications of the SCORE program and the opportunity to reimagine assessment of research credibility.

  • Five members of the Center for Open Science provided a brief review of the reproducibility, robustness, and replicability terminology that underlies the meaning and interpretation of the findings from the SCORE program.

Together, these eight papers offer the following conclusions:

  • These findings replicate and extend prior systematic replication efforts in fields such as cancer biology and psychology with about half of attempts successfully replicating original findings across a diverse sample of research from the quantitative social and behavioral sciences.

  • These findings also replicate and extend prior reproduction and robustness attempts in a variety of fields. A quarter to a third of attempts failed to show the same or similar results when either trying to repeat the same analysis as the original paper (reproducibility), and came to different conclusions when conducting reasonable alternative analyses for the same question (robustness).

  • These findings replicate and extend prior evidence that humans can forecast replication success with reasonable accuracy, and provide weaker evidence than prior studies that machines can provide similarly accurate forecasts.

  • No particular field in the social and behavioral sciences demonstrated consistently higher repeatability than other fields across the three approaches. However, for reproducibility specifically, there were substantial differences in data availability that were associated with higher reproducibility rates in Economics and Political Science compared with other fields.

  • Reform efforts in the social and behavioral sciences over the last 10 years might result in higher repeatability compared with the findings from SCORE which were based on papers published from 2009 to 2018. For example, as highlighted by Miske and colleagues, journal policies across the social and behavioral sciences have strengthened since that time to require sharing data and code and even including reproducibility checks as part of the publication process.

  • Repeatability and credibility assessments are highly diverse. There is no singular assessment of the credibility of a research findings, highlighting the complex process of knowledge production.

“The main message of SCORE is a simple one: research is hard. And, in some ways, the hard work begins after making a discovery. A tremendous amount of effort is needed to verify and have enough confidence in new discoveries to build foundations for further discovery,” said Tim Errington, Senior Director of Research at COS and one of the SCORE project leaders.

The results reveal that there is no single indicator of the repeatability of evidence, or research credibility more generally. There is substantial opportunity for innovation in development of indicators to assess credibility to diversify the understanding of how trustworthy findings are established.

As another SCORE project leader, Fiona Fidler, Professor at the University of Melbourne, shared, “There are a lot of open questions about the factors that foster credibility and repeatability of research findings. Like many productive research efforts, SCORE generated insights, and has prompted even more questions about how to evaluate research in practice.”

In addition to its primary scientific findings, SCORE has generated openly accessible datasets, algorithms, and replication and reanalysis materials. These outputs will support further research on scientific credibility, potentially including development and validation of indicators to improve credibility assessment and accelerate discovery.

“With contributions from almost 900 researchers, the SCORE program provides an enormous amount of evidence to explore and inspire hypotheses for the next round of research. The data and materials are shared publicly so that others might build on this work,” said Sarah Rajtmajer, a SCORE project leader and Associate Professor at Pennsylvania State University.

Visit the website for an overview of the SCORE program, links to the papers, press releases for each paper, and other context to understand the project, findings, and implications.

###

About SCORE

Systematizing Confidence in Open Research and Evidence (SCORE) was a large-scale, multi-method research initiative designed to improve the assessment of scientific credibility in the social and behavioral sciences. Recognizing that evaluating the trustworthiness of research claims is essential but resource-intensive, SCORE aimed to develop scalable, accurate tools for estimating credibility. The program combined expert judgments, machine learning approaches, and empirical assessments of repeatability—including reproducibility, robustness, and replicability—to validate credibility indicators. In addition to its primary scientific goals, SCORE produced openly accessible datasets, algorithms, and evidence that offer unprecedented insight into the state of research credibility. The project began in 2019 and the primary outcomes and outputs were reported and shared in 2026.

###

About the repliCATS project

A team led by Fiona Fidler at the University of Melbourne used a group structured deliberation approach to crowdsource human assessments of replicability of claims.

About Replication Markets

A team of researchers led by Charles Twardy at Amentum developed and conducted prediction markets of human assessments of the replicability of research claims.

About Synthetic Markets

A team led by Sarah Rajtmajer at The Pennsylvania State University, developed bot-populated prediction markets to predict replicability of claims.

About MACROSCORE

A team led by Principal Investigator, Jay Pujara, at the University of Southern California, developed the MACROSCORE system for automated assessment of replicability of claims.

About A+

The A+ system for automated assessment of replicability of claims was developed at TwoSix Technologies (Principal Investigator: James Gentile).

About Metascience Lab
The Metascience Lab at Eötvös Loránd University (Principal Investigator: Balazs Aczel) led the robustness studies conducted in association with the SCORE program.

About COS
Founded in 2013, COS is a nonprofit culture change organization with a mission to increase openness, integrity, and trustworthiness of scientific research. COS pursues this mission by building communities around open science practices, supporting metascience research, and developing and maintaining free, open source software tools, including the Open Science Framework (OSF).

Recent News