Predicting Replicability Challenge: Round 1 Results and Round 2 Opportunity

November 24th, 2025,

Posted in: Replication, Reproducibility, Research, Collaboration, Replicability

Replicability refers to observing evidence for a prior research claim in data that is independent of the prior research. It is one component of establishing credibility of research claims by demonstrating that there is a regularity in nature to be understood and explained. Repeating studies to assess and establish replicability can be costly and time-consuming. Partly, that is inevitable. Research is hard, and establishing reliable and valid evidence takes substantial resources and patience. But there is also room for innovation to gain some understanding of replicability more quickly and efficiently.

The Center for Open Science (COS) launched a public competition earlier this year to investigate automated assessments of replicability of research claims. Several teams have previously published evidence that humans and machines can reliably predict whether research claims are replicable. If automated methods can achieve high validity and precision, they would enable key stakeholders like researchers, policymakers, and funders to improve strategic allocation of resources toward examining claims that are important but have uncertain replicability. And if automated methods are unable to achieve sufficient validity and precision for responsible use, it is important to have that evidence, particularly given the substantial enthusiasm and investment in AI.

Ten teams participated in the first round of the competition. Each developed their own solutions for automating assessment of replicability of research claims. This might have included LLMs, LLM agents, machine learning methods, or any variety of automated methods that they chose.

COS gave the teams training data from prior systematic replication projects with hundreds of replication outcomes to develop their methods. We also gave them 132 research claims from papers in the social and behavioral sciences to generate predictions. Teams had to generate a score from 0 to 1, reflecting the likelihood that each research claim will replicate in a new sample of data. Each team could submit up to three sets of scores for evaluation; we received 30 sets in total.

For each claim, teams received article metadata, the result that was selected for replication, and some key statistical outcomes, including either the reported or recomputed effect size and p-value when available. The table below provides an example of what information teams received about each claim. More details on these and other claims are available on the OSF.

Claim text	Effect value	Effect type	p-value	p-value type	p-value tails
As we predicted, a one-way analysis of variance (ANOVA) on subjective time perceptions showed that perceiving more conflict between goals made people feel more pressed for time [F(3,119) = 4.42, p < .05, partial eta-squared = .100].	.11	Cohen's f^2	.006	exact	two-tailed
Effects of unemployment rates and GDP growth are significant on all EU-framing dimensions (with the exception of an insignificant effect of unemployment rate on the libertarian dimension) and go in the expected direction: an increase in unemployment rates and a decrease in GDP growth implies a higher level of negative EU framing (i.e. the communitarian and libertarian dimensions) and a lower level of positive EU framing (i.e. the cosmopolitan and utilitarian dimensions) (from 'Model 1 -- Cosmopolitan' column in Table 3, 'Unempl. rate' independent variable: estimate = -0.00340; t statistic = -4.03; P < 0.001).	–	–	.001	less-than	(unknown)
Results from models predicting college enrollment using the sample of respondents age 15 and over in 1997 (i.e. eligible to answer the college expectations question) are shown in Table 3. Similar to the results in Table 2, fertility expectations were negatively associated with 4-year college enrollment. [Table 3, Without College Expectation, 4 Year, Fertility expectation: B = -0.11, SE = 0.03, RRR = 0.89, p < .01]	.89	Relative risk ratio	.0002	exact	two-tailed

We use the outcomes of replication studies that were conducted as part of the SCORE program funded by the Defense Advanced Research Projects Agency (DARPA) to evaluate the teams’ predictions. These replication outcomes are not yet publicly available, so teams provided scores without access to which claims were replicated and the outcomes of the replication studies.

Scores were evaluated using the Brier score, a widely used proper scoring rule for assessing probabilistic predictions. This metric prioritizes both calibration — whether the numerical probabilities of the scores correspond to the observed replication rates — and discrimination, whether higher (lower) confidence scores are associated with a higher (lower) probability of successful replication. A lower total Brier score indicates better performance. More details are available here.

Round 1 Results

The figure below shows the results from Round 1 for each team’s best performing set of scores. Also shown are two baseline scores for comparison: a constant set where each score is set to 0.5 (50/50 chance of successful replication), and a random set where each score is drawn randomly between 0 and 1. In Round 1, all of the teams outperformed random generation of scores and all of the teams underperformed the constant.

Brier scores Round 1 SCORE

These results highlight the difficulty of predicting replicability. It is much cheaper to select a constant value to predict replication outcomes than to build AI solutions. To establish the validity and usefulness of AI methods, teams will need to improve their predictions in the subsequent rounds of the competition.

For the next round, we are making the replication outcomes from Round 1 available as additional information to support training the models to generate better predictions for the next round. We have made 130 new claims available for teams to generate predictions of their replicability.

Opportunity to Participate in Round 2

New teams are welcome to join for Round 2. We awarded prizes to the top three teams in Round 1. For Round 2, the first place team will receive $15,000; the second place team $12,000; and the third place team $6,750. If you would like to participate, please fill out this interest form. Teams will have until January 7, 2026 to submit confidence scores for Round 2.

Predicting Replicability Challenge: Round 1 Results and Round 2 Opportunity

Recent Posts