From Open Science to AI: Benchmarking LLMs on Reproducibility, Robustness, and Replication

Written by Center for Open Science | Oct 31, 2025 5:36:35 PM

At the Center for Open Science (COS), our work is about making research more transparent, rigorous, and verifiable. As AI tools enter the research workflow, we need evidence about what they actually contribute to scientific credibility. With support from Open Philanthropy, and in collaboration with researchers at the Pennsylvania State University, Old Dominion University, and the University of Notre Dame, we are building benchmarks to test whether large language model (LLM)–based systems can effectively evaluate, replicate, and conduct scientific research.

In this project, we will evaluate these LLM-based systems on consequential, real-world scientific tasks and compare their judgments to human baselines. Open Philanthropy’s support enables us to translate those results into rigorous, transparent standards for evaluating LLM performance in research workflows, advancing both AI safety and open science practice.

Our current benchmark effort focuses on replication as a central test of whether LLMs can confirm findings in new studies with different data, anticipating outcomes not yet observed. This work builds directly on COS's Defense Advanced Research Projects Agency (DARPA)-funded Systematizing Confidence in Open Research and Evidence (SCORE) project—which established methodologies for assessing research credibility. We are extending that framework to systematically test AI agents in scientific research contexts, evaluating their performance on reproducibility (obtaining results with the same data and analysis) and robustness (whether results hold under reasonable analytical variations) as well. These assessments mirror the daily challenges researchers face in evaluating the credibility of scientific work.

To establish a human baseline, expert teams will assess a subset of research papers, while LLM agents will evaluate the full sample. The teams of human evaluators will be blind to which studies are used for replication, robustness, and reproducibility tests conducted by other researchers. By benchmarking LLMs in this way, COS and our collaboration partners aim to advance understanding of both large language models' capabilities and the assessment of scientific research reliability.