blog/content/posts/2024-08-02-reservoir-sampling/index.md

4.7 KiB

title date draft description tags categories series favorite disable_feed mathjax
Reservoir Sampling 2024-08-02T18:30:56+01:00 false Elegantly sampling a stream
algorithms
python
programming
Cool algorithms
false false true

Reservoir Sampling is an online, probabilistic algorithm to uniformly sample k random elements out of a stream of values.

It's a particularly elegant and small algorithm, only requiring $\Theta(k)$ amount of space and a single pass through the stream.

Sampling one element

As an introduction, we'll first focus on fairly sampling one element from the stream.

def sample_one[T](stream: Iterable[T]) -> T:
    stream_iter = iter(stream)
    # Sample the first element
    res = next(stream_iter)
    for i, val in enumerate(stream_iter, start=1):
        j = random.randint(0, i)
        # Replace the sampled element with probability 1/(i + 1)
        if j == 0:
            res = val
    # Return the randomly sampled element
    return res

Proof

Let's now prove that this algorithm leads to a fair sampling of the stream.

We'll be doing proof by induction.

Hypothesis H_N

After iterating through the first N items in the stream, each of them has had an equal \frac{1}{N} probability of being selected as res.

Base Case H_1

We can trivially observe that the first element is always assigned to res, \frac{1}{1} = 1, the hypothesis has been verified.

Inductive Case

For a given N, let us assume that H_N holds. Let us now look at the events of loop iteration where i = N (i.e: observation of the $N + 1$-th item in the stream).

j = random.randint(0, i) uniformly selects a value in the range [0, i], a.k.a [0, N]. We then have two cases:

  • j == 0, with probability \frac{1}{N + 1}: we select val as the new reservoir element res.

  • j != 0, with probability \frac{N}{N + 1}: we keep the previous value of res. By H_N, any of the first N elements had a \frac{1}{N} probability of being res before at the start of the loop, each element now has a probability \frac{1}{N} \cdot \frac{N}{N + 1} = \frac{1}{N + 1} of being the element.

And thus, we have proven H_{N + 1} at the end of the loop.

Sampling k element

The code for sampling k elements is very similar to the one-element case.

def sample[T](stream: Iterable[T], k: int = 1) -> list[T]:
    stream_iter = iter(stream)
    # Retain the first 'k' elements in the reservoir
    res = list(itertools.islice(stream_iter, k))
    for i, val in enumerate(stream_iter, start=k):
        j = random.randint(0, i)
        # Replace one element at random with probability k/(i + 1)
        if j < k:
            res[j] = val
    # Return 'k' randomly sampled elements
    return res

Proof

Let us once again do a proof by induction, assuming the stream contains at least k items.

Hypothesis H_N

After iterating through the first N items in the stream, each of them has had an equal \frac{k}{N} probability of being sampled from the stream.

Base Case H_k

We can trivially observe that the first k element are sampled at the start of the algorithm, \frac{k}{k} = 1, the hypothesis has been verified.

Inductive Case

For a given N, let us assume that H_N holds. Let us now look at the events of the loop iteration where i = N, in order to prove H_{N + 1}.

j = random.randint(0, i) uniformly selects a value in the range [0, i], a.k.a [0, N]. We then have three cases:

  • j >= k, with probability 1 - \frac{k}{N + 1}: we do not modify the sampled reservoir at all.

  • j < k, with probability \frac{k}{N + 1}: we sample the new element to replace the j-th element of the reservoir. Therefore for any element e \in [0, k[ we can either have:

    • j = e: the element is replaced, probability \frac{1}{k}.
    • j \neq e: the element is not replaced, probability \frac{k - 1}{k}.

We can now compute the probability that a previously sampled element is kept in the reservoir: 1 - \frac{k}{N + 1} + \frac{k}{N + 1} \cdot \frac{k - 1}{k} = \frac{N}{N + 1}.

By H_N, any of the first N elements had a \frac{k}{N} probability of being sampled before at the start of the loop, each element now has a probability \frac{k}{N} \cdot \frac{N}{N + 1} = \frac{k}{N + 1} of being the element.

We have now proven that all elements have a probability \frac{k}{N + 1} of being sampled at the end of the loop, therefore H_{N + 1} has been verified.