Benchmarking LLM Agents on Scientific Tasks

Toward a rigorous, extensible framework for evaluating AI as a scientific collaborator

Overview

As large language models (LLMs) rapidly advance, their potential to assist or even independently conduct scientific research has captured wide attention. From summarizing literature to generating code, LLMs already perform parts of the research process. Yet their credibility, validity, and replicability as scientific contributors remain largely untested.

The Benchmarking LLM Agents on Scientific Tasks project, led by the Center for Open Science (COS) with support from Open Philanthropy, is a multi-year, multi-team initiative to systematically evaluate how well LLM agents can perform and reason through the scientific research lifecycle.

A Multi-Persona Framework for AI in Science

The project defines a structured framework that benchmarks LLM agents according to three core researcher personas, each representing distinct capabilities and roles within the research ecosystem:

Replicate (Replicator): The agent’s ability to replicate published studies using shared data, code, and new data to test whether original results hold.
Evaluate (Peer Reviewer): The agent’s ability to critically evaluate completed research, identifying methodological strengths and weaknesses, assessing claims, and estimating replicability.
Generate (Discovery Scientist): The agent’s ability to propose novel research questions, design studies, conduct analyses, and interpret findings with scientific rigor.

Each persona is mapped onto stages of the research lifecycle—design, conduct, analyze, and interpret—producing a scalable matrix of tasks that can be used to benchmark progress in increasingly complex and autonomous forms of scientific reasoning.

The framework is designed to be expandable: as model capabilities evolve, new task layers can test higher-order reasoning, methodological generalization, and the ability to autonomously connect multiple stages of research into an integrated workflow.

Milestone: ReplicatorBench

The first active phase of this initiative has produced ReplicatorBench, a benchmark for evaluating LLM agents on research replication in the social and behavioral sciences. What sets ReplicatorBench apart from prior work is its focus on the full replication process, not just the final outcome.

Most existing benchmarks evaluate whether an agent arrives at the right answer but science is not just about outcomes. ReplicatorBench requires agents to first preregister their replication plan, then execute their analysis according to that plan, mirroring the standards of rigorous human research. It assesses agents across three stages: extracting and retrieving the right information, designing and executing the analysis, and interpreting results against the original claims. It also includes both replicable and non-replicable studies, so agents must demonstrate they can recognize when a result does not hold, not just reproduce one that does.

Agents are evaluated not only on the accuracy of their outputs but also on their reasoning trace, adherence to their preregistered plan, and ability to handle ambiguity and missing information features that reflect real-world challenges in scientific rigor.

The paper, ReplicatorBench: Benchmarking LLM Agents for Replicability in Social and Behavioral Sciences, has been accepted to the inaugural AI for Sciences track at Knowledge Discovery and Data mining (KDD) 2026 conference (August 9–13, Jeju, Korea) and is currently in production. A preprint is available on arXiv. All code and data are publicly available.

The project is developed collaboratively with partners at Pennsylvania State University, Old Dominion University, and the University of Notre Dame. Each partner contributes to different dimensions of the framework: task design, agent development, evaluation methods, and theoretical modeling of AI scientific behavior.

ReplicatorBench has also been accepted into the Holistic Agent Leaderboard (HAL), a rigorous evaluation infrastructure developed at Princeton to address inconsistencies in how benchmarks and agents are assessed across the field. HAL does not automatically include benchmarks; it maintains a curated set vetted for methodological quality, currently spanning coding, science, and web navigation tasks. Acceptance means ReplicatorBench is now part of HAL’s cost-controlled, standardized evaluation harness and meets the bar the field is converging around for how LLM benchmarks and agents should be tested. In practical terms, this means ReplicatorBench is among the active infrastructure that AI developers and researchers use to test and compare their agents, so the benchmark will continue generating new data on model capabilities as the field evolves, rather than becoming a static, one-time project.

Current Focus: Robustness Benchmark

Building on ReplicatorBench, the team is now developing a Robustness benchmark. Replication asks whether a result holds when tested with new data but a separate and equally important question is whether results hold when the same data are analysed in different, equally justifiable ways. Researchers routinely face genuine degrees of freedom in how they operationalize variables, preprocess data, and select analytical approaches, and reasonable choices at each of these decision points can lead to meaningfully different conclusions.

The Robustness benchmark evaluates whether LLM agents navigate this analytical landscape in ways that are consistent, justified, and transparent or whether their outputs are as variable as those of human analysts working on the same task. Results and protocols will be released openly as the benchmark matures.

Future Directions

Subsequent phases will extend the framework to the Evaluate and Generate benchmarks:

Evaluate: Assessing LLM agents’ ability to function as reviewers within open peer-review ecosystems, including alignment with expert judgment.
Generate: Testing LLMs’ capacity to autonomously design and conduct novel studies, grounded in preregistered planning and ethical guidelines.

Together, these benchmarks will create a continuous, community-driven infrastructure for testing the limits of AI as a scientific collaborator, identifying both capabilities and failure modes, and guiding responsible deployment in research.

Participation and Collaboration

The project is actively expanding. Opportunities for collaboration, including contributing annotations, evaluation rubrics, or replication studies, are welcome. Researchers and organizations interested in learning more or contributing in the future can stay updated through the COS website and project communications.

Please contact Tim Errington (tim@cos.io) or Shakhlo Nematova (shakhlo@cos.io) for more information.