Research evaluates scientific ideas.
We evaluate research.

We are always interested in how research is conducted so we can help make it better. What contributes to reproducibility, or failure to reproduce? What best practices can we develop through evaluation that might increase the efficiency of scientific research? Our goal is to investigate and reveal those insights. Below are projects we have been working on.

Benchmarking LLM Agents on Scientific Tasks

This multi-year initiative, led by the Center for Open Science with support from Open Philanthropy, develops a rigorous framework for evaluating how large language model (LLM) agents perform core elements of the scientific process. By benchmarking AI's ability to replicate, evaluate, and generate research, the project explores whether LLMs can function as credible, transparent, and reliable scientific collaborators.

Learn more

Replicability Project: Health Behavior (RP:HB)

COS has launched the Replicability Project: Health Behavior (RP:HB), a large-scale, multi-team effort to replicate a diverse sample of published quantitative studies to help support a transparent and trustworthy foundation in health research. The project aims to assess the robustness of findings and promote best practices in research transparency, methodological rigor, and reproducibility.

Learn More

Predicting Replicability Challenge

Assessing the validity and trustworthiness of research claims is a critical and labor-intensive part of the scientific process. Initiated by COS, the Predicting Replicability Challenge is a public competition that aims to advance the automated assessment of research claims. The challenge invites teams to develop algorithmic approaches that predict the likelihood of claims being successfully replicated. Open to participants from academic and non-academic spaces worldwide.

Learn More

Instagram Data Access Pilot for Well-being Research

Using innovative methods from the open science movement to promote rigor and transparency of research, Meta and COS are piloting a new approach to industry-academia partnerships for accessing social media data.

Learn More

Lifecycle Journal

Lifecycle Journal aims to put publishing and evaluation in the control of the scholarly community itself. As a 3-year research and development project initiated by COS, the ultimate goal is to transform the vision and value of a journal as an effective facilitator of knowledge production and self-correction while aligning the rewards of publishing with transparency and rigor.

Learn More

Open Scholarship Survey

The Open Scholarship Survey (OSS) is a standard, modular survey that assesses researcher norms about open scholarship practices, blockers to adoption of open scholarship, and researchers’ perceptions about their fields and the credibility of the literature.

Learn More

SMART: Scaling Machine Assessments of Research Trustworthiness

Through a grant from the Robert Wood Johnson Foundation, COS, in partnership with researchers at the University of Melbourne and Pennsylvania State University, have begun the SMART project, which seeks to advance the development of automated confidence evaluation of research claims. SMART will extend the research work initiated by the SCORE program through conducting user research and generating additional data to improve the algorithm and human assessment approaches developed during the program.

Learn More

SCORE: Systematizing Confidence in Open Research and Evidence

There is still much to learn about reproducibility across business, economics, education, political science, psychology, sociology, and other areas of social-behavioral sciences. In order to better assess and predict replicability of social-behavioral science findings, the Center for Open Science, in partnership with Defense Advanced Research Projects Agency (DARPA), is working to help advance this understanding.

Project Overview

Opening Collaboration for Large-Scale Study on Registered Revisions

COS is looking to partner with journals in a semi-centralized meta-RCT on Registered Revisions. Registered Revisions are an in-peer-review device that occurs when reviewers request additional data/analysis. Authors pre-register the methods that will be used to address these requests, and editors and reviewers make their acceptance decision on the basis of this protocol, regardless of the results.

We are conducting an experimental collaborative project in which COS provides a boilerplate study design for journal partners to carry out and publish their own experiments, generating many individual studies under a prospective, living meta-analysis.

Learn More

Measuring the Impact of Registered Reports Using Early Access to Global Flourishing Study Data

Publication pre-commitment devices such as preregistration and Registered Reports have been developed as methods to reduce publication biases, prepublication biases and other questionable research practices. Whether they achieve these ends and whether there are unintended consequences of these interventions remain areas of active research. COS will contribute to this growing body of evidence by conducting a randomized trial of Registered Reports in conjunction with the controlled data release of Global Flourishing Study to examine the impact of preregistration and registered reports on the research pipeline.

Learn More

Reproducibility Project: Cancer Biology (RP:CB)

The RP:CB is an initiative to conduct direct replications of 50 high-impact cancer biology studies. The project anticipates learning more about predictors of reproducibility, common obstacles to conducting replications, and how the current scientific incentive structure affects research practices by estimating the rate of reproducibility in a sample of published cancer biology literature. The RP:CB is a collaborative effort between the Center for Open Science and network provider Science Exchange. Are you interested in becoming a panel member to review the reproducibility of these studies?

Project Overview

Reproducibility Project: Psychology (RP:P)

The RP:P was a collaborative community effort to replicate published psychology experiments from three important journals. Replication teams follow a standard protocol to maximize consistency and quality across replications, and the accumulated data, materials and workflow are to be open for critical review on OSF. One hundred replications were completed.

See the Project

Research Quality of Registered Reports Compared to the Traditional Publishing Model

More than 350 researchers peer reviewed a pair of papers from 29 published Registered Reports and 57 non-RR comparison papers. RRs outperformed comparison papers on all 19 criteria (mean difference=.46) with effects ranging from little difference in novelty (0.13) and creativity (0.22) to substantial differences in rigor of methodology (0.99) and analysis (0.97) and overall paper quality (0.66). RRs could improve research quality while reducing publication bias and ultimately improve the credibility of the published literature.

See the Paper

Credibility of preprints: an interdisciplinary survey of researchers

Preprints increase accessibility and can speed scholarly communication if researchers view them as credible enough to read and use. Preprint services do not provide the heuristic cues of a journal's reputation, selection, and peer-review processes that, regardless of their flaws, are often used as a guide for deciding what to read. We conducted a survey of 3759 researchers across a wide range of disciplines to determine the importance of different cues for assessing the credibility of individual preprints and preprint services.

See the Paper

Collaborative Replications and Education Project (CREP)

The Collaborative Replications and Education Project facilitates student research training through conducting replications. The community-led team composed a list of studies that could be replicated as part of research methods courses, independent studies, or bachelor theses. Replication teams are encouraged to submit their results to an information commons for aggregation for potential publication. This integrates learning and substantive contribution to research.

See the Project

Crowdsourcing a Dataset

Crowdsourcing a dataset is a method of data analysis in which multiple independent analysts investigate the same research question on the same data set in whatever manner they consider to be best. This approach should be particularly useful for complex data sets in which a variety of analytic approaches could be used, and when dealing with controversial issues about which researchers and others have very different priors. This first crowdsourcing project establishes a protocol for independent simultaneous analysis of a single dataset by multiple teams, and resolution of the variation in analytic strategies and effect estimates among them. View the paper here.

See the Project

Badges to Acknowledge Open Practices

Openness is a core value of scientific practice. The sharing of research materials and data facilitates critique, extension, and application within the scientific community, yet current norms provide few incentives for researchers to share evidence underlying scientific claims. We demonstrate that badges are effective incentives that improve the openness, accessibility, and persistence of data and materials that underlie scientific research.

See the Paper

Many Labs I

Many Labs I project was a crowdsourced replication study in which the same 13 psychological effects were examined in 36 independent samples to examine variability in replicability across sample and setting.

Results

Variations in sample and setting had little impact on observed effect magnitudes
When there was variation in effect magnitude across samples, it occurred in studies with large effects, not studies with small effects
Replicability was much more dependent on the effect of study rather than the sample or setting in which it was studied
Replicability held even across lab-web and across nations
Two effects in a subdomain with substantial debate about reproducibility (flag and currency priming) showed no evidence of an effect in individual samples or in the aggregate.

See the Project

Many Labs II

Conducted in Fall of 2014, Many Labs II employed the same model as Many Labs I but with almost 30 effects, more than 100 laboratories, and including samples from more than 20 countries. The findings should be released in late-2017.

See the Project

Many Labs III

Many psychologists rely on undergraduate participant pools as their primary source of participants. Most participant pools are made up of undergraduate students taking introductory psychology courses over the course of a semester. Also conducted in Fall of 2014, Many Labs III systematically evaluated time-of-semester effects for 10 psychological effects across many participant pools. Twenty labs administered the same protocol across the academic semester. The aggregate data will provide evidence as to whether the time-of-semester moderates the detectability of effects.

See the Project

SCORE: Systematizing Confidence in Open Research and Evidence

Project Overview

Hero image: Photo by ThisisEngineering RAEng on Unsplash

Research evaluates scientific ideas.We evaluate research.

Benchmarking LLM Agents on Scientific Tasks

Replicability Project: Health Behavior (RP:HB)

Predicting Replicability Challenge

Instagram Data Access Pilot for Well-being Research

Lifecycle Journal

Open Scholarship Survey

SMART: Scaling Machine Assessments of Research Trustworthiness

SCORE: Systematizing Confidence in Open Research and Evidence

Opening Collaboration for Large-Scale Study on Registered Revisions

Measuring the Impact of Registered Reports Using Early Access to Global Flourishing Study Data

Reproducibility Project: Cancer Biology (RP:CB)

Reproducibility Project: Psychology (RP:P)

Research Quality of Registered Reports Compared to the Traditional Publishing Model

Credibility of preprints: an interdisciplinary survey of researchers

Collaborative Replications and Education Project (CREP)

Crowdsourcing a Dataset

Badges to Acknowledge Open Practices

Many Labs I

Many Labs II

Many Labs III

SCORE: Systematizing Confidence in Open Research and Evidence

Research evaluates scientific ideas.
We evaluate research.