Objectives

The primary objective of SCORE was to create scalable and accurate indicators of repeatability that could be applied across large bodies of research. Central to this aim was the development of algorithmic tools capable of producing confidence scores for research claims, thereby allowing researchers, institutions, and other stakeholders to more efficiently identify claims that warrant further scrutiny.

This effort was supported by systematic examinations of three key dimensions of repeatability: reproducibility, or the ability to obtain the same results from the same data and analysis; robustness, or the extent to which results remain consistent across justified alternative analytical choices; and replicability, or the consistency of results when new data are collected to address the same research question.

With SCORE, we also sought to understand how these measures relate to one another, to expert and machine-generated predictions, and to other potentially relevant indicators such as disciplinary norms or journal policies. A further objective was to generate openly accessible datasets, algorithms, and replication and reanalysis materials, thus supporting continued innovation in credibility assessment across the scientific community.

sarah headshot

 

"With contributions from almost 900 researchers, the SCORE program provides an enormous amount of evidence to explore and inspire hypotheses for the next round of research. The data and materials are shared publicly so that others might build on this work."

Sarah Rajtmajer, a SCORE project leader and Associate Professor at Pennsylvania State University

Approach

To achieve its aims, SCORE employed a multi-method approach that combined claim extraction, expert and machine assessments, and large-scale empirical evaluations of repeatability. The program began by identifying thousands of research claims from published articles in the social and behavioral sciences and then generated expert judgments and machine learning predictions about the credibility of those claims. These predictions served as candidate indicators that could be validated through empirical testing. The validation efforts consisted of three major empirical studies: reproducibility, robustness, and replicability.

The reproducibility study examined whether original findings can be recreated using the same data and analyses. Drawing on a stratified random sample of 600 papers, reproducibility assessments were conducted for papers in which data were publicly available or successfully obtained from authors.

The robustness study investigated the degree to which research findings depend on analysts’ choices. Independent re-analysts each reanalyzed the same dataset for 100 selected claims, allowing the program to assess the extent of analytical variability and its implications for scientific conclusions.

The replicability study tested whether original positive findings generalize to new data, using high-powered replication attempts of 274 claims drawn from 164 papers. Across these studies, SCORE integrated the resulting evidence to evaluate how reproducibility, robustness, and replicability relate to each other and to predictive assessments by humans or machines.

Data, materials, and tools generated through this process are openly shared to support transparency, reuse, and further methodological development.

Read the papersExplore data & code

Questions about the project can be directed to cosscore@cos.io.