literal probability neglect

I came across an interesting problem on the website of UofT prof David Liu (who I had the privilege of TAing for a few years back). Part of what makes the problem amusing—I think⁠—is how we respond to it.

Consider the following Python program:

def has_even(A):
    for i in range(len(A)):
        if A[i] % 2 == 0:
            return i
    return -1

Suppose we’re running it on the domain of binary strings of length $N$ . What’s the average number of loop iterations? What does your intuition tell you?

Dave’s solution is rigorous—he’s trying to instil good practices in his students—but if, like me, you haven’t done this in a while, it’s helpful to take a more primitive tack.

We’ve got $2^{N}$ of these strings, so to get the average number of loop iterations, we multiply $\frac{1}{2^{N}}$ by the sum of loop iterations required for each string. What does this sum look like? Well, half of the strings will have a $0$ in the first position. These will require a single loop iteration, and there are $\frac{2^{N}}{2}$ of them. Of the remaining strings, half (so a quarter of the total) will have a $0$ in the second position. These will require two loop iterations, and there are $\frac{2^{N}}{4}$ of them. Continuing in this fashion, our sum looks like this:

$\frac{1 * 2^{N}}{2} + \frac{2 * 2^{N}}{4} + \frac{3 * 2^{N}}{8} + \frac{4 * 2^{N}}{16} + ... + \frac{(N-1) * 2^{N}}{2^{N-1}} + \frac{N * 2^{N}}{2^{N-1}}$

(The final term is an annoying edge-case: the string that is all $1$ s requires the same number of loop iterations as the string that has a $0$ only in the final position.) So our average runtime is:

$\frac{1}{2^{N}} * [\frac{1 * 2^{N}}{2} + \frac{2 * 2^{N}}{4} + \frac{3 * 2^{N}}{8} + ... + \frac{(N-1) * 2^{N}}{2^{N-1}} + \frac{N * 2^{N}}{2^{N-1}}] = \frac{1}{2} + \frac{2}{4} + \frac{3}{8} + ... + \frac{(N-1)}{2^{N-1}} + \frac{N}{2^{N-1}}$

The first and final terms are both bounded by $1$ . Let’s put them aside. By grouping the remaining terms in pairs (exercise left to the reader) we see that the sum is bounded by:

$1 + \frac{1}{2} + \frac{1}{4} + \frac{1}{8} + ... + \frac{1}{2^{N}}$

We might remember from our analysis class that no matter the size of $N$ , this sum is always strictly less than $2$ : so the runtime is constant! There will always be bigger and bigger strings of all $1$ s, but they are so rare that their effect is drowned out.

By parallel reasoning, the same holds if we take our domain to be arrays of uniformly distributed random integers. You can vary the modulus in the function, and the offset is given by the appropriate geometric series. If, like me, you’ve been ruined by a career in engineering and you prefer simulations to real math, here you go:

avg_runtime

Until I worked through the problem it wasn’t at all clear to me what the answer was, and I was a little surprised when I finally arrived at $\mathcal{O}(1)$ . This might be because I’ve been out of school for long enough that I only know how to convert back and forth from JSON to Python dictionaries. But maybe there’s a more interesting explanation.

It’s well-documented that people don’t naturally think in terms of probabilities. Scientists have literally won Nobel prizes for demonstrating this. If you don’t believe me you can read them; alternatively you can spend an afternoon at your local offtrack parlor. The gravity of an event warps our perception of its likelihood. We fret over worst-case-scenarios—no matter how unlikely they are—because they stick out in our mind. We buy lottery tickets for a similar reason. We regard winning the 649 as a “life-altering” event, not as a “hopelessly improbable” one. This is sometimes called probability neglect.

When I looked at the problem, I thought of the strings that consist entirely of $1$ s. They are going to take $N$ iterations, so presumably, my reasoning went, they will inflate the average runtime. The fact that they are an exponentially small minority didn’t even cross my mind until I started working it out. I saw them in terms of their impact, rather than their likelihood. My familiarity with geometric series didn’t prevent me from falling into this trap. The problem serves as a nice quantitative illustration of probability neglect, and a reminder that if you have the time, it never hurts to do the math.

Published by Dave Fernig