Predicting Replicability Challenge: Round 2 Results

Written by Andrew Tyner | Mar 10, 2026 12:00:01 PM

Growth trends are producing a “strain on scientific publishing” (Hanson et al., 2023). The rates of scientific output have risen from just under 4 million publications in 2000 to over 10 million in 2024 and the total size of the reviewer pool has not kept pace, leading to a shortage in evaluation capacity. The recent explosion of AI-assisted and fully AI-generated research threatens to overwhelm journals’ ability to review submissions in a timely and sustainable way, such that evaluation rather than production will become the limiting factor in the growth of knowledge.

These trends underscore the need for new approaches to assessing research, and in particular the urgency of scalable methods that can meet the challenge of mounting scientific production. These concerns motivated the DARPA Systematizing Confidence in Open Research and Evidence (SCORE) program, and more recently, the Predicting Replicability Challenge, which was launched last year and recently concluded its second round. The program focused on replicability, just one component of a complex process of assessing the trustworthiness of research findings (Nosek et al., 2026). By focusing on replicability, the challenge provides a way to test whether automated methods can support faster, more scalable assessment of research claims in a rapidly expanding research landscape.

Participating teams are provided a set of research claims and accompanying metadata and are tasked with assigning each claim a 0-1 confidence score that corresponds to the claim’s likelihood of being successfully replicated in a new sample of data. In the first round, 10 teams assessed 132 claims, with no teams outperforming a baseline Brier score of .25. The .25 benchmark is notable as it represents the Brier score for a baseline model that simply assigns a confidence score of 0.50 to each claim, representing no information about the claims at all.

The replication outcomes of the first round claims were made available for the second round for use in training. In the second round, 15 teams assessed 130 claims, with 8 teams returning from the first round and 7 new teams joining the competition. All of the teams outperformed the baseline Brier score in the second round. The figure below shows each of the participating teams with colored lines connecting the scores for the eight teams that participated in both rounds. The baseline Brier score is indicated by the horizontal line. Lower Brier scores indicate improved performance.

That each of the top models achieved a Brier score below the baseline indicates that teams in the second round picked up on some signal for predicting replicability that was not achieved in the first round.

Where We Saw Improvement

The analyses below describe which components of predictive performance improved across rounds. The Brier score is composed of three elements: calibration, resolution, and uncertainty. Uncertainty is a measure of the difficulty of the task, and it increases the closer the replication rate gets to 50%. The replication rates across the two rounds were similar (52% and 55%), suggesting that task difficulty is unlikely to explain the difference in performance.

Calibration measures how closely the observed replication rates match the predictions. The confidence score predictions can range from 0-1, with higher values corresponding to a higher predicted replication rate. In a well calibrated model, research claims that are assigned a confidence score of .2 should replicate approximately 20% of the time, claims that are assigned a .4 confidence score should replicate 40% of the time, and so on. We can get a sense of calibration across rounds in the figure below.

In this figure, the confidence scores of all claims from each team’s best-performing model are binned into equal-sized segments. The gray bars indicate the expected replication rate for the claims in each bin. Confidence scores between 0.6 and 0.7, for example, should replicate approximately 65% of the time, which is reflected in the height of that particular bar. The actual replication rates of the claims in each bin are plotted on top of the grey bars, with the red bars indicating Round 1 claims and the blue bars indicating Round 2 claims. Two takeaways are apparent from this figure. First, the observed replication rates in the first round bear essentially no resemblance to the predictions for each bin, and indeed they hardly vary at all since each bin contains claims that replicated approximately 50% of the time. Second, the replication rates of the second round claims, reflected in the blue bars, clearly correspond with the confidence score bins and come reasonably close to matching the expected values. This isn’t perfect calibration by any means – for example, the replication rates in the first three bins overshoot their expected values of 5%, 15%, and 25%, and the replication rate in the top bin is clearly short of 95%. Still, the improvement in calibration is notable.

We also investigate whether improvements in resolution could be contributing to the gains in predictive performance. Resolution assesses whether predictions co-vary with the replication rates. Some evidence for this appears in the figure above. The minimal variation in observed replication rates across confidence score bins suggests low resolution in the first round, while the steadily increasing replication rates across bins in the second round suggests improved resolution.

We can dig deeper by measuring how sharply predictions distinguish the replicated claims from the claims that did not replicate, i.e., discrimination. The figure below plots the confidence score predictions for all claims from each team’s top-performing model (N = 132 claims in Round 1, N = 130 claims in Round 2). These predictions are separated by round and by whether the claim was successfully replicated or not.

The distribution of predictions in the first round is essentially equivalent whether claims replicated successfully or not. Both are centered around 0.5 and have a heavier tail on the right than on the left side. By contrast, the predictions in Round 2 more sharply distinguish claims by replication outcome. Here the predictions continue to be centered around 0.5 for the claims that did not replicate successfully, and they have a heavier tail on the left side indicating lower confidence scores for replication success. The claims that replicated successfully are centered closer to 0.6 and have a heavier tail on the right side indicating higher confidence scores for replication success. In short, the confidence score predictions in the second round do a better job of discriminating successful and unsuccessful replication outcomes.

Both figures underscore room for improvement. The bulk of the confidence scores in Round 2 continue to overlap across the two categories, and claims that failed to replicate are receiving confidence scores that are too high given the empirical results.

Looking Ahead to Round 3

We welcome new and existing teams to participate in the third and final round of the Predicting Replicability Challenge, which will launch on April 22 with a new set of research claims. We anticipate a smaller test set – most likely 40-55 claims. Brier scores dipping below .25 was an important benchmark for the second round, but for automated prediction to represent a viable solution to the challenge of research evaluation, we need to see continuing improvement with models consistently achieving Brier scores below .20, representing a 20% improvement in skill relative to the naive baseline.

If you would like to participate, please fill out this interest form. The first place team will receive $7,500, the second place team will receive $6,000, and the third place team will receive $3,375.

View full post