Simulating conditional probability - looping versus vectorization

By Paul English, Wed 30 November 2016, in category Programming

math programming, probability, python

Conditional probability can be tricky if you're new to the topic, or if you've got a particularly complex question/problem. One particularly useful trick is to simply simulate your probability with some general programming language.

Let's look at a very simple example. Say we roll a fair six-sided die, and flip coins equal to the number rolled. What is the probability we flip exactly 2 heads?

A die is modeled well with a uniform random variable $N \sim \text{unif}(1,6)$. A series of coin flips is modeled with a binomial random variable $X \sim \text{binom}(n, 0.5)$ where $n$ is the number of coin flips, and $0.5$ is the probability of a fair coin landing on heads.

With these random variables, answering our question can be formulated as $[X = 2, N=n] = [X = 2 | N=n] [N = n]$. I use the shorthand square brackets to represent the probability mass function or equivalently the density function just for convenience, e.g. $\mathbb{P}(X=2) = [X=2]$.

First let's solve this analytically. We use the probability functions for the uniform & binomial distributions,

$$ \begin{align} [X = 2 | N=n] [N = n] &= \frac{1}{6} \sum_{n=1}^6 {n \choose 2} \frac{1}{2}^2 \frac{1}{2}^{n-2} \ &= 0.2578125 \end{align} $$

If we didn't know how to approach this problem initially, it's relatively straightforward to do so in code. A simulation is usually just a loop, where we perform the same action over and over. When we're dealing with probabilities that can be different from one run to another, running multiple iterations helps us approach the true probability.

Let's try this. We'll create a loop, and in each iteration we will generate a random number between 1 and 6 to simulate the die roll, and then using that we'll sample from a binomial distribution to represent our coin flips.

n_iter = int(1e7)

samples = []

for i in tqdm(range(n_iter)):
    n = random.randint(1,6)
    samples.append(np.random.binomial(n,0.5) >= 3)

# > 100%|██████████| 10000000/10000000 [00:38<00:00, 262078.03it/s]
#   0.2581388

I run this with 10 million iterations, which is far more than needed to approximate this value. The reason is not just to approximate the answer, but to show that this is a relatively slow approach, it took 38 seconds total. We can do a lot better by using vectorized operations (in Python we do this with numpy, but other languages like Juila, Matlab, and R support these natively).

Let's look at the same 10 million iterations, this time instead of a loop we build an array of 10 million random samples from a uniform distribution, and pass them directly into the binomial function, which can use this array immediately.

n_iter = int(1e7)
np.mean(np.random.binomial(np.random.randint(1,7,n_iter), 0.5) == 2)
# > 0.25761450000000002

If we use the timeit magic function inside a Jupyter notebook, we can see that this is significantly quicker to compute.

n_iter = int(1e7)
np.mean(np.random.binomial(np.random.randint(1,7,n_iter), 0.5) == 2)
# > 1 loop, best of 3: 918 ms per loop

This is not the most complex example, but it's illustrates the point of vectorized versus looped operations. When you're working with a library like numpy, or tensorflow, or in any language that makes arrays a first class citizen your goal should always be to compute with vectorized operations. Your free time will thank you for it, and usually as a bonus your code will be more concise and readable. This tends to be one of the first things I end up teaching to any math and statistics friends who are trying their hand at programming, I feel like it's always a good thing to explain to those new at numerical programming.

The notebook with these examples is available here.