From b5d3086542d89ed3feac3451b2ea171b2f4c241f Mon Sep 17 00:00:00 2001 From: Bruno BELANYI Date: Fri, 2 Aug 2024 21:04:49 +0100 Subject: [PATCH] posts: reservoir-sampling: add one-element sample --- .../2024-08-02-reservoir-sampling/index.md | 56 +++++++++++++++++++ 1 file changed, 56 insertions(+) diff --git a/content/posts/2024-08-02-reservoir-sampling/index.md b/content/posts/2024-08-02-reservoir-sampling/index.md index 795c6c1..2196d0e 100644 --- a/content/posts/2024-08-02-reservoir-sampling/index.md +++ b/content/posts/2024-08-02-reservoir-sampling/index.md @@ -25,3 +25,59 @@ amount of space and a single pass through the stream. [online]: https://en.wikipedia.org/wiki/Online_algorithm + +## Sampling one element + +As an introduction, we'll first focus on fairly sampling one element from the +stream. + +```python +def sample_one[T](stream: Iterable[T]) -> T: + stream_iter = iter(stream) + # Sample the first element + res = next(stream_iter) + for i, val in enumerate(stream_iter, start=1): + j = random.randint(0, i) + # Replace the sampled element with probability 1/(i + 1) + if j == 0: + res = val + # Return the randomly sampled element + return res +``` + +### Proof + +Let's now prove that this algorithm leads to a fair sampling of the stream. + +We'll be doing proof by induction. + +#### Hypothesis $H_N$ + +After iterating through the first $N$ items in the stream, +each of them has had an equal $\frac{1}{N}$ probability of being selected as +`res`. + +#### Base Case $H_1$ + +We can trivially observe that the first element is always assigned to `res`, +$\frac{1}{1} = 1$, the hypothesis has been verified. + +#### Inductive Case + +For a given $N$, let us assume that $H_N$ holds. Let us now look at the events +of loop iteration where `i = N` (i.e: observation of the $N + 1$-th item in the +stream). + +`j = random.randint(0, i)` uniformly selects a value in the range $[0, i]$, +a.k.a $[0, N]$. We then have two cases: + +* `j == 0`, with probability $\frac{1}{N + 1}$: we select `val` as the new +reservoir element `res`. + +* `j != 0`, with probability $\frac{N}{N + 1}$: we keep the previous value of +`res`. By $H_N$, any of the first $N$ elements had a $\frac{1}{N}$ probability +of being `res` before at the start of the loop, each element now has a +probability $\frac{1}{N} \cdot \frac{N}{N + 1} = \frac{1}{N + 1}$ of being the +element. + +And thus, we have proven $H_{N + 1}$ at the end of the loop.