the theoretical urn

In physics and chemistry, better instruments and larger sample sizes raise the bar for making accurate predictions. But in psychology, software usability testing, and biomedical science—any discipline that relies heavily on significance testing—they have the opposite effect.

In the 1960s, University of Minnesota psychologist Paul Meehl pointed out a lot of the problems underpinning science’s ongoing replication crisis. Most relate to injudicious use of null hypothesis significance testing (NHST). His critiques clearly didn’t get the attention they deserved (otherwise we wouldn’t be here). In recent years some of his arguments are finally starting to get their due scrutiny, particularly the base-rate fallacy and the emphasis on publishing positive results (For a review of NHST and the base-rate fallacy, I recommend chapters 3 and 5 of Alex Reinhart’s Statistics done wrong).

One critique of Meehl’s that I haven’t yet heard anyone talk about is “the theoretical urn.” In a 1967 paper, Meehl demonstrates a paradox of how better instruments and bigger sample sizes can abet bad science. The key premise is that the point null hypothesis—that there is literally zero difference between the control group and the test group—is almost always false. He’s not alone in holding this view. Andrew Gelman makes the same argument. Here’s Meehl’s justification:

Considering the fact that “everything in the brain is connected with everything else,” and that there exist several “general state-variables” (such as arousal, attention, anxiety, and the like) which are known to be at least slightly influenceable by practically any kind of stimulus input, it is highly unlikely that any psychologically discriminable stimulation which we apply to an experimental subject would exert literally zero effect upon any aspect of his performance.

Meehl imagines two urns. The first contains theories of the form “the treatment group exceeds the control group on measure X.” The second contains experiments in which the former group is given some treatment that is withheld from the latter. Since the point null hypothesis is always false, then as sample sizes and measurement accuracy grow, every experiment’s outcome leans slightly one way or the other. When we randomly pre-register a theory and pair it with a high-sample size, high-accuracy experiment, there is a 50% chance the experiment outcome will be skewed in the direction that confirms our hypothesis. We just have to have guessed the right direction.

At this point you might interject that significance testing is there to protect us. And to some degree it will. But—and this is Meehl’s whole point—its protection will erode as our instruments improve and our sample size increases. Eventually, the deviation from the null hypothesis will surface “significantly.” Hence Meehl:

The effect of increased precision, whether achieved by improved instrumentation and control, greater sensitivity in the logical structure of the experiment, or increasing the number of observations, is to yield a probability approaching 1/2 of corroborating our substantive theory by a significance test, even if the theory is totally without merit.

Here’s an example of the kind of thing I think Meehl was worried about. You see a lot of studies in the language learning literature where researchers vary some part of the learning experience (the time of the lesson, exercise during the lesson, sleep after the lesson, etc) and then compare test scores. They find that one group did better, and proffer some explanation as to why. The explanation is typically pretty vague, invoking the sort of high-level constructs Meehl talks about.

So vague, I’m inclined to argue, that even if we grant that this result wasn’t a type 1 error, we have basically learned nothing. We already knew that “everything in the brain is connected with everything else.” And because our mechanistic explanation is so hand-wavy, we don’t know whether the effect is coming from a deep and important connection that we can exploit in applied science, or just a relic of our particular experimental setup. Maybe the test group studied harder outside of class to compensate for their weird classroom experience. Maybe neither group of students had any skin in the game, and the control group simply found it easier to zone out.

If, like me, you work in software, you are probably interested at two kinds of studies: online A/B tests and offline user tests. For online A/B tests, I’m much more worried about the base-rate fallacy than the theoretical urn. This is because we have the benefit of directly optimizing for quantities of interest. This is not to say that direct optimization is the be-all-and-end-all of building data products—my last post is a long winded essay about why it isn’t—but it does mean that the question of whether a significant difference can be chalked up to a relic of your experimental setup is moot. The distinction between lab setups and the real world that plagues psychology doesn’t apply to online tests. If a feature increases conversion, it increases conversion. The only circumstance in which it doesn’t increase conversion is when it doesn’t; namely, when your result is a fluke. And flukes, combined with our biased interest toward positive results, are a legitimate worry. If you’re running 100 copywriting experiments, each with a significance level of 5%, you’re probably going to get some false positives.

I think that smaller, offline user studies are where we need to be more worried about the theoretical urn. Likert scale statistics, response times, and completion rates are merely proxies for how users will behave in the wild. A significant difference on one of these measures could be due to the sterile setting rather than an improvement in the product. To be confident in your results, instead of increasing sample sizes and tweaking your instrumentation, test the stability of your results. What happens when you reword your questions? What happens when you use a different rating scale? What happens when you change rooms? A result that is robust against these permutations is more convincing that a result that happened to pass an arbitrary significant threshold.

Further reading

Theory-Testing in Psychology and Physics: A Methodological Paradox
Paul E. Meehl, 1967

Statistical power and underpowered statistics
Alex Reinhart, 2015

The p value and the base rate fallacy
Alex Reinhart, 2015

The Puzzle of Paul Meehl: An intellectual history of research criticism in psychology
Andrew Gelman, 2016

“I feel like the really solid information therein comes from non or negative correlations”
Andrew Gelman, 2019

Published by Dave Fernig