Estimating Expectations and Variances using Python

Last week, for pedagogical reasons, I had to solve the following problem:

In a bag there are blue and red balls. We randomly take balls from the bag without replacement. Let the random variable denote the number of blue balls we get. Find out the expectation and variance of .

The problem is quite a simple one. For finding the expectation, we can use the definition

We can also compute

Then we get the variance using the formula

It is a straightforward exercise to calculate the probabilities. I got , , and ¹. Substituting in the equations I got the expectation as and variance as .

I wanted to double-check the values. Instead of retracing the steps as a sane person would have done², I wrote a small program to simulate the experiment.

The Simulation

Here is the class RedBlue which defines the essential methods:

import random as rand
from enum import Enum

class Color(Enum):
    RED = 1
    BLUE = 2


class RedBlue:
    def __init__(self, num_reds, num_blues):
        self.num_reds = num_reds
        self.num_blues = num_blues

        self.bag = [Color.RED] * num_reds + [Color.BLUE] * num_blues


    def prob_blues(self, num_picks, num_trials):
        """Return the probability of getting 'i' blue balls when 'num_picks' balls are
        randomly taken from 'self.bag' without replacement. The return value is
        a list whose ith element is the probability of getting 'i' blue balls.
        """
        counts = [0] * (num_picks + 1)

        for i in range(num_trials):
            counts[ self.count_blues(num_picks) ] += 1

        return [ x/num_trials for x in counts ]


    def count_blues(self, num_picks):
        """Returns the number of blue balls observed when 'num_picks' balls are taken
        from 'self.bag' without replacement."""
        balls = rand.sample(self.bag, num_picks)
        return balls.count(Color.BLUE)

The sample function in Python’s random library is used to get a random sample sample from the input population, without replacement. The count_blues function gets a sample, and then counts the number of blue balls it contains. The prob_blues function repeatedly calls count_balls to estimate the probability of getting each possible number of blue balls.

My initial idea was to compute the probabilities using the prob_blues function, and then to use the formulas for and to calculate the Expectation and Variance.

It then stuck me that it was an unnecessarily roundabout way. Expectation is ultimately just the mean of the random variable. So we could as well take the average of the values that count_blues returns to get the Expectation. can also be computed similarly.

So I ended up not using the prob_blues function. The rest of the script is given below:

rb = RedBlue(5, 4)

# Compute the expectation and variance
s = 0                       # For E[X]
s2 = 0                      # For E[X^2]
num_trials = int(1e5)
num_picks = 3
for i in range(num_trials):
    num_blues = rb.count_blues(num_picks)
    s += num_blues
    s2 += num_blues * num_blues
mean = s/num_trials
mean_x2 = s2/num_trials
var = mean_x2 - mean * mean
print("Mean = {0}, Variance = {1}".format(mean, var))

The program outputs the estimated Expectation as , and Variance as . Both are close to the analytical quantities.

Advantages

Is this method more efficient than computing the probabilities first and then finding the weighted average? The answer is no. Whether we compute the probabilities or not, we end up doing essentially the same work. If we do trials and if there are distinct values for the random variable, to compute the probabilities we need additions. Then to calculate the expectation and variance, we need a further multiplications and additions.

In the function given above, all of the computation happens in a loop which gets executed times. Therefore we have multiplications and additions. If additions and multiplications are equally costly (as they are for small-enough numbers), both methods are equivalent in complexity. For bigger numbers (those which can’t fit in one machine word), multiplication is costlier than addition, and our function would perform worse.

There is, however, one important advantage: for computing the probabilities, we need to spend amount of space. We save that memory when we do the direct computation of Expectations. For our particular problem, that does not make any difference, as is only . But in general, this might matter a lot.

The Geometric Random Variable

The geometric random variable is a case in point. It stands for the number of trials we need to perform for being successful. If we encounter failures before the first success, the value of the random variable is . We denote the probability of success in a single trial as .

The geometric random variable can have any positive integer value. Therefore, to estimate its expectation and variance using the probability method, we need amount of memory in the (unlikely) worst case, where each trial gives a different value. The direct method has no such headache.

Even more importantly, it also works for continuous random variables.

The Uniform Random Variable

Possibly the simplest continuous random variable is the Uniform random variable. Python’s random module’s random() function is a uniform random variable with range . To compute its expectation and variance, we need to use integration. We can of course approximate the integration as a sum and use something very similar to the prob_blues function to evaluate it.

But it is much clearer to use the direct method:

s, s2 = 0, 0
num_trials = int(1e5)

for i in range(num_trials):
    x = rand.random()
    s += x
    s2 += x*x

mean = s/num_trials
mean_x2 = s2/num_trials

var = mean_x2 - mean * mean

print("Estimated mean = {0:.4f}, variance = {1:.4f}".format(mean, var))

When you run the code, you should get the estimated mean to be close to , which is the Expectation of the Uniform random variable. The variance would be around the theoretical value .

If you want to play around with the code, you can get the files from the Expectation and Variance sub-directory of this git repository.

Why do I keep 42 in the denominator? Because it is the answer! ↩
I assume ↩

The Simulation

Advantages

The Geometric Random Variable

The Uniform Random Variable

Footnotes