blog/content/posts/2024-08-02-reservoir-sampling/index.md

84 lines
2.4 KiB
Markdown
Raw Normal View History

2024-08-02 22:04:30 +02:00
---
title: "Reservoir Sampling"
date: 2024-08-02T18:30:56+01:00
draft: false # I don't care for draft mode, git has branches for that
description: "Elegantly sampling a stream"
tags:
- algorithms
- python
categories:
- programming
series:
- Cool algorithms
favorite: false
disable_feed: false
mathjax: true
---
[_Reservoir Sampling_][reservoir] is an [online][online], probabilistic
algorithm to uniformly sample $k$ random elements out of a stream of values.
It's a particularly elegant and small algorithm, only requiring $\Theta(k)$
amount of space and a single pass through the stream.
[reservoir]: https://en.wikipedia.org/wiki/Reservoir_sampling
[online]: https://en.wikipedia.org/wiki/Online_algorithm
<!--more-->
## Sampling one element
As an introduction, we'll first focus on fairly sampling one element from the
stream.
```python
def sample_one[T](stream: Iterable[T]) -> T:
stream_iter = iter(stream)
# Sample the first element
res = next(stream_iter)
for i, val in enumerate(stream_iter, start=1):
j = random.randint(0, i)
# Replace the sampled element with probability 1/(i + 1)
if j == 0:
res = val
# Return the randomly sampled element
return res
```
### Proof
Let's now prove that this algorithm leads to a fair sampling of the stream.
We'll be doing proof by induction.
#### Hypothesis $H_N$
After iterating through the first $N$ items in the stream,
each of them has had an equal $\frac{1}{N}$ probability of being selected as
`res`.
#### Base Case $H_1$
We can trivially observe that the first element is always assigned to `res`,
$\frac{1}{1} = 1$, the hypothesis has been verified.
#### Inductive Case
For a given $N$, let us assume that $H_N$ holds. Let us now look at the events
of loop iteration where `i = N` (i.e: observation of the $N + 1$-th item in the
stream).
`j = random.randint(0, i)` uniformly selects a value in the range $[0, i]$,
a.k.a $[0, N]$. We then have two cases:
* `j == 0`, with probability $\frac{1}{N + 1}$: we select `val` as the new
reservoir element `res`.
* `j != 0`, with probability $\frac{N}{N + 1}$: we keep the previous value of
`res`. By $H_N$, any of the first $N$ elements had a $\frac{1}{N}$ probability
of being `res` before at the start of the loop, each element now has a
probability $\frac{1}{N} \cdot \frac{N}{N + 1} = \frac{1}{N + 1}$ of being the
element.
And thus, we have proven $H_{N + 1}$ at the end of the loop.