Toward a rigorous, extensible framework for evaluating AI as a scientific collaborator
As large language models (LLMs) rapidly advance, their potential to assist or even independently conduct scientific research has captured wide attention. From summarizing literature to generating code, LLMs already perform parts of the research process. Yet their credibility, validity, and replicability as scientific contributors remain largely untested.
The Benchmarking LLM Agents on Scientific Tasks project, led by the Center for Open Science (COS) with support from Open Philanthropy, is a multi-year, multi-team initiative to systematically evaluate how well LLM agents can perform and reason through the scientific research lifecycle.
The project defines a structured framework that benchmarks LLM agents according to three core researcher personas, each representing distinct capabilities and roles within the research ecosystem:
Each persona is mapped onto stages of the research lifecycle—design, conduct, analyze, and interpret—producing a scalable matrix of tasks that can be used to benchmark progress in increasingly complex and autonomous forms of scientific reasoning.
The framework is designed to be expandable: as model capabilities evolve, new task layers can test higher-order reasoning, methodological generalization, and the ability to autonomously connect multiple stages of research into an integrated workflow.
The first active phase of this initiative centers on the Replication benchmark, which examines whether LLM agents can accurately and transparently replicate published scientific results. Replication is a cornerstone of credible science and a natural proving ground for assessing whether AI can handle multi-step reasoning in real research contexts.
In this benchmark, LLM agents are tasked with performing end-to-end replication workflows that mirror the process followed by human replicators:
Agents are evaluated not only on the accuracy of their outputs but also on their reasoning trace, adherence to analytic plans, and ability to handle ambiguity and missing information, features that reflect real-world challenges in scientific rigor.
The project is being developed collaboratively with partners at Pennsylvania State University, Old Dominion University, and the University of Notre Dame. Each partner contributes to different dimensions of the framework: task design, agent development, evaluation methods, and theoretical modeling of AI scientific behavior.
To ensure transparency and scalability, the project will release its task schemas, evaluation protocols, and results, inviting replication, re-analysis, and community-driven feedback.
Subsequent phases will extend the framework to the Evaluate and Generate benchmarks:
Together, these benchmarks will create a continuous, community-driven infrastructure for testing the limits of AI as a scientific collaborator, identifying both capabilities and failure modes, and guiding responsible deployment in research.
The project is currently in its development and internal testing phase. Broader opportunities for collaboration, such as contributing annotations, evaluation rubrics, or replication studies, will be announced in upcoming phases.
Researchers and organizations interested in learning more or contributing in the future can stay updated through the COS website and project communications.
Please contact Tim Errington (tim@cos.io) or Shakhlo Nematova (shakhlo@cos.io) for more information.

6218 Georgia Avenue NW, Suite #1, Unit 3189
Washington, DC 20011
Email: contact@cos.io

Unless otherwise noted, this site is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) License.
Responsible stewards of your support
COS has earned top recognition from Charity Navigator and Candid (formerly GuideStar) for our financial transparency and accountability to our mission. COS and the OSF were also awarded SOC2 accreditation in 2023 after an independent assessment of our security and procedures by the American Institute of CPAs (AICPA).
We invite all of our sponsors, partners, and members of the community to learn more about how our organization operates, our impact, our financial performance, and our nonprofit status.