A Critique of the Many Labs Projects

May 28th, 2019,

From left to right: Rick Klein (leader of Many Labs 1, 2, and 4), Charlie Ebersole (leader of Many Labs 3 and 5), and Olivia Atherton (leader of Many Labs 3). 

The Many Labs projects provide an opportunity to investigate important questions regarding how we can conduct and interpret research. They also have some important limitations. Below, Charlie Ebersole discusses his experiences leading some of the Many Labs projects and what he sees as the most important limitations of those efforts.


Charlie Ebersole, University of Virginia

I’ve had the privilege of leading two of the five (to date) Many Labs projects. I’ve worked with wonderful teams of incredibly talented and engaged researchers. As I’ve said many times before, the people are the best part of this gig (“this gig” being research) and I believe that, in large part, because of my experience with Many Labs. I think these projects have given the fields of psychology and meta-science a lot of interesting and important insights and data.

They also have problems.

It’s trite to say that any research project has limitations (of course it does) but it’s also important to talk about those limitations. I’ve seen a solid few try their hand at critiquing the Many Labs projects in the past - some distinctly better so than others. The thing that’s agitated me the most about those critiques, though, is not that they’ve happened. Critiques are welcome; critiques are important. It’s that the critiques to date often miss many of what I think are the biggest problems with these projects. I want the Many Labs projects to have an impact on how we look at research. However, I want that impact to be calibrated to the evidence that these efforts provide. That’s the best kind of impact.

So I thought I’d try my hand at providing what I thought were the best critiques of the Many Labs projects. Some disclaimers up top:

  • These are not all my original ideas. I'm sure I've seen some of these in some form or another in the various commentaries (published or otherwise) on the Many Labs projects or heard them in some engaging conversations I've had over the years. I've tried to credit folks where I remember; sorry to those I forgot. (One person deserves a general shout out though - Erin Westgate. Many of the best conversations I’ve had about these issues and others like them have been with Erin. I’m sure she’s influenced my thinking throughout this piece. She also gave great comments/suggestions on an earlier draft of this post.)

  • Putting this out as a kind of blog post seemed like the appropriate medium. Trying to publish something pointing out your own mistakes/shortcomings kinda seems like double-dipping.

  • I'm focusing on the limitations here. There are a ton of really good things about the Many Labs-style approach. You can read better-written (relative to this post) descriptions of those good things in papers like this one and this one.

  • I'm also focusing specifically on the Many Labs projects. Some of these may apply to other crowdsourced projects, but I want to speak about the projects I know best.

  • Finally, I'm speaking solely for myself. These are what I think are the best critiques of the Many Labs projects. I very well might be wrong about some subset of them (see Limitation 1).

Here, in no particular order, are what I see as the main limitations of the Many Labs projects:

Limitation 1 - The leadership
Each of these projects has had a point person who served as leader for the whole enterprise. The Many Labs projects undoubtedly could not have happened without a sizeable set of contributions from lots of other researchers. Nevertheless, they were organized with one person (mostly) in charge. This limits the vision of these projects.

So far, the five Many Labs projects have all been led by one of two plucky white dudes from the US Midwest studying in social cognition labs from the same academic family tree. (Rick is the handsome one.) That shared experience likely limits which studies are considered for inclusion in these projects. Although we’ve strived for breadth of topics, and solicited input from our broader teams, an undue amount of influence comes back to the project leader. So far that leadership has drawn on a shared and narrow exposure to the psychological literature.

This is a place where I think the Psychological Science Accelerator can really improve on past crowdsourced studies. The fact that they’re opening these kinds of projects up to different people is great for the field. And besides, the PSA is being led by a plucky white dude from Indiana, not Ohio or Pennsylvania. Totally different.

Limitation 2 - The pragmatics of study selection select for particular kinds of studies
This is one of the more common critiques I’ve seen and I think it’s one of the best. In these projects, we try to collect as many studies as we can in one experimental session. That biases toward short studies with simple designs. Online studies are the best. Trust me. For Many Labs 3, we tried to replicate a study that required participants to hold a clipboard that was either light or heavy in weight. Nothing makes you feel like you’re contributing to science quite like using your advisor’s Amazon account to spend $500 dollars on clipboards to ship to 20 different universities while he quickly runs off to call his spouse before she thinks that someone has stolen their Amazon account (someone who apparently loves clipboards). The only comparable feeling is then having to make an instructional video so that you can instruct all of your collaborators on how to appropriately hand each clipboard to participants. (A bit of trivia, I ended up marrying my demo participant.) These selection pressures likely limit the generalizability of our conclusions, both in terms of the meta-scientific aims and the extent to which any single Many Labs project approximates the replicability of a given field.

This is another case where I think the Psychological Science Accelerator can be really impactful. More complicated studies take much more coordination, resources, and effort. The size of the PSA, both in terms of organizational/logistical support and the number of participating labs, might make these kinds of studies possible in a crowdsourcing framework.

Limitation 3 - Our projects incentivize picking studies that are more likely to replicate
There’s a narrative I’ve seen that suggests we pick studies for the Many Labs projects that we think won’t replicate. That’s wrong. In fact the opposite is true. We want these studies to replicate. Badly. There are two reasons for this (a good one and a bad one):

The Good Reason
The Many Labs projects have focused on trying to investigate meta-scientific questions - I’ve seen getting some information on the replicability of some effects as a bonus, but not the primary goal (but see Limitation 4). Many Labs 1 and 2 sought to estimate heterogeneity between collection sites. Many Labs 3 tried to investigate time-of-semester as a moderator of replicability. Many Labs 4 and 5 both studied the impact of expertise on replicability. To test any of these questions, you need effects that replicate. You cannot moderate that which does not exist. The best chance we have to test our questions of interest is to include lots of studies that will replicate. If they don’t we’re hosed. It’s like running a study on moderators of a floor effect - you’re not going to find anything. For that reason, we try to pick things that we think might work.

The Bad Reason
Your life is so much easier when effects replicate. Peer review is a breeze for effects that pan out again. It’s a lot simpler to explain that outcome - just read the original paper. When an effect doesn’t replicate, though, then you have a mystery to solve. There are many more things to check and more analyses that reviewers want run. (Unless you make your data and scripts open and the reviewers look at them themselves - shoutout to you Hans IJzerman and Daniel Lakens.) I don’t think that’s a bad thing for reviewers to do; it’s frankly the logical response. Unexpected results spur more thought. They just also create more work. In Many Labs 3, just 3 of 10 effects replicated. The reviews for ML3 were twice as long as the actual paper.

Regardless of reason, this pressure to select studies we believe will replicate likely limits the Many Labs projects as an unbiased indicator of replicability of the field, overall. Combining across several of these kinds of projects probably gives a better estimate, but it’s good to be aware of this possible bias. If estimating replicability is the explicit goal, projects could sample randomly from the underlying study population of interest, whether that’s a subfield, particular journals, or specific topic areas. That, to date, has not been the primary goal of the Many Labs projects.

Limitation 4 - We select one out of many possible operationalizations of an effect
As said above, we try to spread our bets when picking things to replicate. Sticking to one construct is risky, because it might not pan out. If we decided to replicate 10 studies on the same phenomenon, and that particular phenomenon turned out to be unreliable, we’d be incapable of testing any of our meta-scientific questions. As such, we generally select a wide set of phenomena and then test one (or maybe a couple, for some heuristic-style studies) operationalization for each effect.

This limits the extent to which the results of any one of our studies generalizes to a broader phenomenon or theory. Maybe we just picked the worst operationalization and that’s why the study didn’t replicate? Maybe we picked the best operationalization and that’s why it did replicate? We can’t know for sure. This is another criticism I’ve seen in a few places before and I think it is another really good one. The Many Labs studies weren’t designed with this sort of theory testing in mind (Many Labs 4 is the most in this direction). Projects aimed at this goal may be better off taking a deep dive into a given theory/effect using several distinct operationalizations/methodologies. Different questions call for different methodological choices.

Limitation 5 - Power
Many Labs projects collect giant samples of individual participants. This gives us really precise estimates for the effects we replicate. However, the thing I care most about in these projects are the broader meta-scientific questions we’re trying to test. Power is a lot trickier for these questions. (I’d like to thank Courtney Soderberg for many in-depth conversations about this topic and to Anonymous Reviewer #1 on Many Labs 5 for reminding me that Courtney is right.) At the meta-science level, the study being replicated is often the more important level of analysis. We replicate way fewer studies than we collect human participants - we may have thousands of participants, but only tens of studies. That becomes even more of an issue when underlying studies unexpectedly don’t replicate (see Limitation 3). For instance, how do you compare variation in effects across the semester (in the case of Many Labs 3) when only three effects replicate in the first place? Maybe some effects do vary more than others; but with a small sample of studies that replicated, that kind of variation may be difficult to detect. Plus, we’re often sailing into uncharted waters with these meta-scientific effects, so trying to guess our likely power is very difficult. I do think we’re developing important meta-scientific insights from the Many Labs projects. However, I think we also need to respect the fact that they may be preliminary, and not definitive, insights.

Limitation 6 - Student samples
So far, the vast majority of participants in Many Labs projects have been university undergrads. Those participants may not be good representations of their countries/regions or of the human population at large, even when we sample across many countries and regions. This is the one critique that I’m sure is 100% my own very original idea. (Of course I’m kidding. It came from talking to Rick Klein and Nick Buttrick about this issue. I think they came up with it though.)

Limitation 7 - Calibrating the amount of influence these projects should have is difficult
Interpreting Many Labs projects, like any other research project, is difficult. What should we learn from them? How confident should we be in those lessons? These are tough questions to answer. I think they become even more important for projects like the Many Labs studies. For better (in my selfish opinion) or worse (in the opinions of some others), these studies have attracted a lot of attention. Their prominence can make them easy sources of heuristics.

“Is this effect real? Well, did it replicate in that Many Labs project? Nope? Guess it’s not real then!”

“Do time-of-semester effects exist? Well, Many Labs 3 didn’t find any. Guess they don’t matter!”

Those might be valid conclusions to draw from the Many Labs projects. They also might not be. As mentioned previously, I think the Many Labs projects provide important information on questions like these. The evidence generated by these projects should inform our beliefs. However, I think they can easily be perceived as providing definitive answers, given their scale and the attention they’ve received. That kind of thinking might be premature.

As I said up top, appropriately calibrated impact is the best kind of impact. I hope this list helps folks interested in the Many Labs projects calibrate what they take away from them.

And finally, I’ll end with what I think is the most definitive conclusion from the Many Labs projects - there are lots of awesome people in the field. Whether they’ve been collaborators, reviewers, original authors, or editors, I’ve gotten the opportunity to meet and work with a ton of talented, bright, and passionate people. I’ll forever be grateful for that.

Despite these limitations, there’s a lot of potential in crowdsourced research. For more discussion of that, see this introduction to the Psychological Science Accelerator and this paper on the many uses of crowdsourcing.

Recent Posts