Benchmarking LLM Agents on Scientific Tasks

Toward a rigorous, extensible framework for evaluating AI as a scientific collaborator

Overview

As large language models (LLMs) rapidly advance, their potential to assist or even independently conduct scientific research has captured wide attention. From summarizing literature to generating code, LLMs already perform parts of the research process. Yet their credibility, validity, and replicability as scientific contributors remain largely untested.

The Benchmarking LLM Agents on Scientific Tasks project, led by the Center for Open Science (COS) with support from Open Philanthropy, is a multi-year, multi-team initiative to systematically evaluate how well LLM agents can perform and reason through the scientific research lifecycle.

A Multi-Persona Framework for AI in Science

The project defines a structured framework that benchmarks LLM agents according to three core researcher personas, each representing distinct capabilities and roles within the research ecosystem:

  • Replicate (Replicator): The agent’s ability to replicate published studies using shared data, code, and new data to test whether original results hold.
  • Evaluate (Peer Reviewer): The agent’s ability to critically evaluate completed research, identifying methodological strengths and weaknesses, assessing claims, and estimating replicability.
  • Generate (Discovery Scientist): The agent’s ability to propose novel research questions, design studies, conduct analyses, and interpret findings with scientific rigor.

Each persona is mapped onto stages of the research lifecycle—design, conduct, analyze, and interpret—producing a scalable matrix of tasks that can be used to benchmark progress in increasingly complex and autonomous forms of scientific reasoning.

The framework is designed to be expandable: as model capabilities evolve, new task layers can test higher-order reasoning, methodological generalization, and the ability to autonomously connect multiple stages of research into an integrated workflow.

Current Focus: The Replication Benchmark

The first active phase of this initiative centers on the Replication benchmark, which examines whether LLM agents can accurately and transparently replicate published scientific results. Replication is a cornerstone of credible science and a natural proving ground for assessing whether AI can handle multi-step reasoning in real research contexts.

In this benchmark, LLM agents are tasked with performing end-to-end replication workflows that mirror the process followed by human replicators:

  1. Extracting hypotheses and design details from published papers.
  2. Generating a preregistration style replication plan in a structured format.
  3. Adapting analytic code to replication datasets and executing analyses.
  4. Interpreting and documenting results against the original claims.

Agents are evaluated not only on the accuracy of their outputs but also on their reasoning trace, adherence to analytic plans, and ability to handle ambiguity and missing information, features that reflect real-world challenges in scientific rigor.

The project is being developed collaboratively with partners at Pennsylvania State University, Old Dominion University, and the University of Notre Dame. Each partner contributes to different dimensions of the framework: task design, agent development, evaluation methods, and theoretical modeling of AI scientific behavior.

To ensure transparency and scalability, the project will release its task schemas, evaluation protocols, and results, inviting replication, re-analysis, and community-driven feedback. 

Future Directions

Subsequent phases will extend the framework to the Evaluate and Generate benchmarks:

  • Evaluate: Assessing LLM agents’ ability to function as reviewers within open peer-review ecosystems, including alignment with expert judgment.
  • Generate: Testing LLMs’ capacity to autonomously design and conduct novel studies, grounded in preregistered planning and ethical guidelines.

Together, these benchmarks will create a continuous, community-driven infrastructure for testing the limits of AI as a scientific collaborator, identifying both capabilities and failure modes, and guiding responsible deployment in research.

Participation and Collaboration

The project is currently in its development and internal testing phase. Broader opportunities for collaboration, such as contributing annotations, evaluation rubrics, or replication studies, will be announced in upcoming phases.

Researchers and organizations interested in learning more or contributing in the future can stay updated through the COS website and project communications.

Please contact Tim Errington (tim@cos.io) or Shakhlo Nematova (shakhlo@cos.io) for more information.