Compare commits

...

40 commits

Author SHA1 Message Date
Bruno BELANYI 2dc2ab4659 nix: bump inputs
Some checks failed
ci/woodpecker/push/deploy/2 Pipeline failed
ci/woodpecker/cron/deploy/2 Pipeline failed
2024-08-02 21:16:33 +01:00
Bruno BELANYI 013cfb1af0 Add Reservoir Sampling post
Some checks failed
ci/woodpecker/push/deploy/2 Pipeline failed
2024-08-02 21:12:05 +01:00
Bruno BELANYI d9daa68902 posts: reservoir-sampling: add k-element sampling 2024-08-02 21:11:51 +01:00
Bruno BELANYI b5d3086542 posts: reservoir-sampling: add one-element sample 2024-08-02 21:11:26 +01:00
Bruno BELANYI 2dfbe3f0aa posts: add 'reservoir-sampling' 2024-08-02 21:11:26 +01:00
Bruno BELANYI accd30a9d8 markdownlint: relax duplicate header check 2024-08-02 21:11:26 +01:00
Bruno BELANYI 13f014af90 config: enable MathJax 2024-08-02 21:03:47 +01:00
Bruno BELANYI 691cf704f4 Add Treap Revisited post
Some checks failed
ci/woodpecker/push/deploy/2 Pipeline failed
ci/woodpecker/cron/deploy/2 Pipeline failed
2024-07-27 18:41:07 +01:00
Bruno BELANYI aa72b269c2 posts: treap: add removal 2024-07-27 18:41:07 +01:00
Bruno BELANYI eaaa9c9ae2 posts: treap-revisited: add insertion 2024-07-27 18:41:07 +01:00
Bruno BELANYI b3ed39eef3 posts: treap: add merge 2024-07-27 18:41:07 +01:00
Bruno BELANYI aee1fa55c8 posts: treap-revisited: add split 2024-07-27 18:41:07 +01:00
Bruno BELANYI 021c40f6ac posts: treap-revisited: add implementation 2024-07-27 18:31:47 +01:00
Bruno BELANYI 6f813eb787 posts: add 'treap-revisited' 2024-07-27 18:31:24 +01:00
Bruno BELANYI a868474177 posts: gap-buffer: fix typo
Some checks failed
ci/woodpecker/push/deploy/2 Pipeline was successful
ci/woodpecker/cron/deploy/2 Pipeline failed
2024-07-20 19:29:59 +01:00
Bruno BELANYI 9728aa537e posts: union-find: fix typo 2024-07-20 19:29:59 +01:00
Bruno BELANYI 4c12ebb16c Add Treap post 2024-07-20 19:29:59 +01:00
Bruno BELANYI ca89605db5 posts: treap: add insertion 2024-07-20 19:29:59 +01:00
Bruno BELANYI 1539ed5ce9 posts: treap: add search 2024-07-20 19:29:59 +01:00
Bruno BELANYI f981ee9903 posts: treap: add construction 2024-07-20 19:29:59 +01:00
Bruno BELANYI a4b02f3644 posts: treap: add presentation 2024-07-20 19:29:59 +01:00
Bruno BELANYI 6d7b2cce11 posts: add treap 2024-07-20 19:28:16 +01:00
Bruno BELANYI 2118ba170f layouts: add Mermaid support
All checks were successful
ci/woodpecker/push/deploy/2 Pipeline was successful
Similar to Graphviz and TikZ support.
2024-07-20 17:24:54 +01:00
Bruno BELANYI fc0479a1dc layouts: add Graphviz support
Similar to TikZ support.
2024-07-20 17:24:54 +01:00
Bruno BELANYI 491041df09 layouts: tikz: allow using file input
Makes it easier to handle big diagrams.
2024-07-20 17:21:42 +01:00
Bruno BELANYI 189cdcf05a Add Bloom Filter post
All checks were successful
ci/woodpecker/push/deploy/2 Pipeline was successful
2024-07-20 14:16:47 +01:00
Bruno BELANYI de48eb9e94 Add Gap Buffer post 2024-07-20 14:16:47 +01:00
Bruno BELANYI 27152689ea posts: bloom-filter: add lookup 2024-07-20 14:16:47 +01:00
Bruno BELANYI d1a67510ef posts: gap-buffer: add movement 2024-07-20 14:16:47 +01:00
Bruno BELANYI 2c31c1aff2 posts: bloom-filter: add insertion 2024-07-20 14:16:47 +01:00
Bruno BELANYI e05ed1cc4a posts: gap-buffer: add deletion 2024-07-20 14:16:47 +01:00
Bruno BELANYI 798116716f posts: bloom-filter: add construction 2024-07-20 14:16:47 +01:00
Bruno BELANYI 72057a3224 posts: gap-buffer: add insertion 2024-07-20 14:16:47 +01:00
Bruno BELANYI 3992996a89 posts: bloom-filter: add presentation 2024-07-20 14:16:47 +01:00
Bruno BELANYI 0084c8717a posts: gap-buffer: add growth 2024-07-20 14:16:47 +01:00
Bruno BELANYI f4a64b2a37 posts: add bloom-filter 2024-07-20 14:16:47 +01:00
Bruno BELANYI 4d69be0633 posts: gap-buffer: add accessors 2024-07-20 14:16:47 +01:00
Bruno BELANYI 091e8527e3 posts: gap-buffer: add construction 2024-07-20 14:16:47 +01:00
Bruno BELANYI a4976aeefb posts: gap-buffer: add presentation 2024-07-20 14:16:47 +01:00
Bruno BELANYI 239d5c3dbd posts: add gap-buffer 2024-07-20 14:16:47 +01:00
16 changed files with 1857 additions and 21 deletions

3
.markdownlint.yaml Normal file
View file

@ -0,0 +1,3 @@
# MD024/no-duplicate-heading/no-duplicate-header
MD024:
siblings_only: true

View file

@ -67,6 +67,7 @@ params:
webmentions:
login: belanyi.fr
pingback: true
mathjax: true
taxonomies:
category: "categories"

View file

@ -8,6 +8,8 @@ tags:
categories:
favorite: false
tikz: true
graphviz: true
mermaid: true
---
## Test post please ignore
@ -40,6 +42,29 @@ echo hello world | cut -d' ' -f 1
\end{tikzpicture}
{{% /tikz %}}
### Graphviz support
{{% graphviz %}}
graph {
a -- b
b -- c
c -- a
}
{{% /graphviz %}}
### Mermaid support
{{% mermaid %}}
graph TD
A[Enter Chart Definition] --> B(Preview)
B --> C{decide}
C --> D[Keep]
C --> E[Edit Definition]
E --> B
D --> F[Save Image and Code]
F --> B
{{% /graphviz %}}
### Spoilers
{{% spoiler "Don't open me" %}}

View file

@ -15,7 +15,7 @@ favorite: false
disable_feed: false
---
To kickoff the [series]({{< ref "/series/cool-algorithms/">}}) of posts about
To kickoff the [series]({{< ref "/series/cool-algorithms/" >}}) of posts about
algorithms and data structures I find interesting, I will be talking about my
favorite one: the [_Disjoint Set_][wiki]. Also known as the _Union-Find_ data
structure, so named because of its two main operations: `ds.union(lhs, rhs)` and

View file

@ -0,0 +1,191 @@
---
title: "Gap Buffer"
date: 2024-07-06T21:27:19+01:00
draft: false # I don't care for draft mode, git has branches for that
description: "As featured in GNU Emacs"
tags:
- algorithms
- data structures
- python
categories:
- programming
series:
- Cool algorithms
favorite: false
disable_feed: false
---
The [_Gap Buffer_][wiki] is a popular data structure for text editors to
represent files and editable buffers. The most famous of them probably being
[GNU Emacs][emacs].
[wiki]: https://en.wikipedia.org/wiki/Gap_buffer
[emacs]: https://www.gnu.org/software/emacs/manual/html_node/elisp/Buffer-Gap.html
<!--more-->
## What does it do?
A _Gap Buffer_ is simply a list of characters, similar to a normal string, with
the added twist of splitting it into two side: the prefix and suffix, on either
side of the cursor. In between them, a gap is left to allow for quick
insertion at the cursor.
Moving the cursor moves the gap around the buffer, the prefix and suffix getting
shorter/longer as required.
## Implementation
I'll be writing a sample implementation in Python, as with the rest of the
[series]({{< ref "/series/cool-algorithms/" >}}). I don't think it showcases the
elegance of the _Gap Buffer_ in action like a C implementation full of
`memmove`s would, but it does makes it short and sweet.
### Representation
We'll be representing the gap buffer as an actual list of characters.
Given that Python doesn't _have_ characters, let's settle for a list of strings,
each representing a single character...
```python
Char = str
class GapBuffer:
# List of characters, contains prefix and suffix of string with gap in the middle
_buf: list[Char]
# The gap is contained between [start, end) (i.e: buf[start:end])
_gap_start: int
_gap_end: int
# Visual representation of the gap buffer:
# This is a very [ ]long string.
# |<----------------------------------------------->| capacity
# |<------------>| |<-------->| string
# |<------------------->| gap
# |<------------>| prefix
# |<-------->| suffix
def __init__(self, initial_capacity: int = 16) -> None:
assert initial_capacity > 0
# Initialize an empty gap buffer
self._buf = [""] * initial_capacity
self._gap_start = 0
self._gap_end = initial_capacity
```
### Accessors
I'm mostly adding these for exposition, and making it easier to write `assert`s
later.
```python
@property
def capacity(self) -> int:
return len(self._buf)
@property
def gap_length(self) -> int:
return self._gap_end - self._gap_start
@property
def string_length(self) -> int:
return self.capacity - self.gap_length
@property
def prefix_length(self) -> int:
return self._gap_start
@property
def suffix_length(self) -> int:
return self.capacity - self._gap_end
```
### Growing the buffer
I've written this method in a somewhat non-idiomatic manner, to make it closer
to how it would look in C using `realloc` instead.
It would be more efficient to use slicing to insert the needed extra capacity
directly, instead of making a new buffer and copying characters over.
```python
def grow(self, capacity: int) -> None:
assert capacity >= self.capacity
# Create a new buffer with the new capacity
new_buf = [""] * capacity
# Move the prefix/suffix to their place in the new buffer
added_capacity = capacity - len(self._buf)
new_buf[: self._gap_start] = self._buf[: self._gap_start]
new_buf[self._gap_end + added_capacity :] = self._buf[self._gap_end :]
# Use the new buffer, account for added capacity
self._buf = new_buf
self._gap_end += added_capacity
```
### Insertion
Inserting text at the cursor's position means filling up the gap in the middle
of the buffer. To do so we must first make sure that the gap is big enough, or
grow the buffer accordingly.
Then inserting the text is simply a matter of copying its characters in place,
and moving the start of the gap further right.
```python
def insert(self, val: str) -> None:
# Ensure we have enouh space to insert the whole string
if len(val) > self.gap_length:
self.grow(max(self.capacity * 2, self.string_length + len(val)))
# Fill the gap with the given string
self._buf[self._gap_start : self._gap_start + len(val)] = val
self._gap_start += len(val)
```
### Deletion
Removing text from the buffer simply expands the gap in the corresponding
direction, shortening the string's prefix/suffix. This makes it very cheap.
The methods are named after the `backspace` and `delete` keys on the keyboard.
```python
def backspace(self, dist: int = 1) -> None:
assert dist <= self.prefix_length
# Extend gap to the left
self._gap_start -= dist
def delete(self, dist: int = 1) -> None:
assert dist <= self.suffix_length
# Extend gap to the right
self._gap_end += dist
```
### Moving the cursor
Moving the cursor along the buffer will shift letters from one side of the gap
to the other, moving them accross from prefix to suffix and back.
I find Python's list slicing not quite as elegant to read as a `memmove`, though
it does make for a very small and efficient implementation.
```python
def left(self, dist: int = 1) -> None:
assert dist <= self.prefix_length
# Shift the needed number of characters from end of prefix to start of suffix
self._buf[self._gap_end - dist : self._gap_end] = self._buf[
self._gap_start - dist : self._gap_start
]
# Adjust indices accordingly
self._gap_start -= dist
self._gap_end -= dist
def right(self, dist: int = 1) -> None:
assert dist <= self.suffix_length
# Shift the needed number of characters from start of suffix to end of prefix
self._buf[self._gap_start : self._gap_start + dist] = self._buf[
self._gap_end : self._gap_end + dist
]
# Adjust indices accordingly
self._gap_start += dist
self._gap_end += dist
```

View file

@ -0,0 +1,97 @@
---
title: "Bloom Filter"
date: 2024-07-14T17:46:40+01:00
draft: false # I don't care for draft mode, git has branches for that
description: "Probably cool"
tags:
- algorithms
- data structures
- python
categories:
- programming
series:
- Cool algorithms
favorite: false
disable_feed: false
---
The [_Bloom Filter_][wiki] is a probabilistic data structure for set membership.
The filter can be used as an inexpensive first step when querying the actual
data is quite costly (e.g: as a first check for expensive cache lookups or large
data seeks).
[wiki]: https://en.wikipedia.org/wiki/Bloom_filter
<!--more-->
## What does it do?
A _Bloom Filter_ can be understood as a hash-set which can either tell you:
* An element is _not_ part of the set.
* An element _may be_ part of the set.
More specifically, one can tweak the parameters of the filter to make it so that
the _false positive_ rate of membership is quite low.
I won't be going into those calculations here, but they are quite trivial to
compute, or one can just look up appropriate values for their use case.
## Implementation
I'll be using Python, which has the nifty ability of representing bitsets
through its built-in big integers quite easily.
We'll be assuming a `BIT_COUNT` of 64 here, but the implementation can easily be
tweaked to use a different number, or even change it at construction time.
### Representation
A `BloomFilter` is just a set of bits and a list of hash functions.
```python
BIT_COUNT = 64
class BloomFilter[T]:
_bits: int
_hash_functions: list[Callable[[T], int]]
def __init__(self, hash_functions: list[Callable[[T], int]]) -> None:
# Filter is initially empty
self._bits = 0
self._hash_functions = hash_functions
```
### Inserting a key
To add an element to the filter, we take the output from each hash function and
use that to set a bit in the filter. This combination of bit will identify the
element, which we can use for lookup later.
```python
def insert(self, val: T) -> None:
# Iterate over each hash
for f in self._hash_functions:
n = f(val) % BIT_COUNT
# Set the corresponding bit
self._bit |= 1 << n
```
### Querying a key
Because the _Bloom Filter_ does not actually store its elements, but some
derived data from hashing them, it can only definitely say if an element _does
not_ belong to it. Otherwise, it _may_ be part of the set, and should be checked
against the actual underlying store.
```python
def may_contain(self, val: T) -> bool:
for f in self._hash_functions:
n = f(val) % BIT_COUNT
# If one of the bits is unset, the value is definitely not present
if not (self._bit & (1 << n)):
return False
# All bits were matched, `val` is likely to be part of the set
return True
```

View file

@ -0,0 +1,159 @@
---
title: "Treap"
date: 2024-07-20T14:12:27+01:00
draft: false # I don't care for draft mode, git has branches for that
description: "A simpler BST"
tags:
- algorithms
- data structures
- python
categories:
- programming
series:
- Cool algorithms
favorite: false
disable_feed: false
graphviz: true
---
The [_Treap_][wiki] is a mix between a _Binary Search Tree_ and a _Heap_.
Like a _Binary Search Tree_, it keeps an ordered set of keys in the shape of a
tree, allowing for binary search traversal.
Like a _Heap_, it associates each node with a priority, making sure that a
parent's priority is always higher than any of its children.
[wiki]: https://en.wikipedia.org/wiki/Treap
<!--more-->
## What does it do?
By randomizing the priority value of each key at insertion time, we ensure a
high likelihood that the tree stays _roughly_ balanced, avoiding degenerating to
unbalanced O(N) height.
Here's a sample tree created by inserting integers from 0 to 250 into the tree:
{{< graphviz file="treap.gv" />}}
## Implementation
I'll be keeping the theme for this [series] by using Python to implement the
_Treap_. This leads to somewhat annoying code to handle the rotation process,
which is easier to do in C using pointers.
[series]: {{< ref "/series/cool-algorithms/" >}}
### Representation
Creating a new `Treap` is easy: the tree starts off empty, waiting for new nodes
to insert.
Each `Node` must keep track of the `key`, the mapped `value`, and the node's
`priority` (which is assigned randomly). Finally it must also allow for storing
two children (`left` and `right`).
```python
class Node[K, V]:
key: K
value: V
priority: float
left: Node[K, V] | None
righg: Node[K, V] | None
def __init__(self, key: K, value: V):
# Store key and value, like a normal BST node
self.key = key
self.value = value
# Priority is derived randomly
self.priority = random()
self.left = None
self.right = None
class Treap[K, V]:
_root: Node[K, V] | None
def __init__(self):
# The tree starts out empty
self._root = None
```
### Search
Searching the tree is the same as in any other _Binary Search Tree_.
```python
def get(self, key: K) -> T | None:
node = self._root
# The usual BST traversal
while node is not None:
if node.key == key:
return node.value
elif node.key < key:
node = node.right
else:
node = node.left
return None
```
### Insertion
To insert a new `key` into the tree, we identify which leaf position it should
be inserted at. We then generate the node's priority, insert it at this
position, and rotate the node upwards until the heap property is respected.
```python
type ChildField = Literal["left, right"]
def insert(self, key: K, value: V) -> bool:
# Empty treap base-case
if self._root is None:
self._root = Node(key, value)
# Signal that we're not overwriting the value
return False
# Keep track of the parent chain for rotation after insertion
parents = []
node = self._root
while node is not None:
# Insert a pre-existing key
if node.key == key:
node.value = value
return True
# Go down the tree, keep track of the path through the tree
field = "left" if key < node.key else "right"
parents.append((node, field))
node = getattr(node, field)
# Key wasn't found, we're inserting a new node
child = Node(key, value)
parent, field = parents[-1]
setattr(parent, field, child)
# Rotate the new node up until we respect the decreasing priority property
self._rotate_up(child, parents)
# Key wasn't found, signal that we inserted a new node
return False
def _rotate_up(
self,
node: Node[K, V],
parents: list[tuple[Node[K, V], ChildField]],
) -> None:
while parents:
parent, field = parents.pop()
# If the parent has higher priority, we're done rotating
if parent.priority >= node.priority:
break
# Check for grand-parent/root of tree edge-case
if parents:
# Update grand-parent to point to the new rotated node
grand_parent, field = parents[-1]
setattr(grand_parent, field, node)
else:
# Point the root to the new rotated node
self._root = node
other_field = "left" if field == "right" else "right"
# Rotate the node up
setattr(parent, field, getattr(node, other_field))
setattr(node, other_field, parent)
```

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,146 @@
---
title: "Treap, revisited"
date: 2024-07-27T14:12:27+01:00
draft: false # I don't care for draft mode, git has branches for that
description: "An even simpler BST"
tags:
- algorithms
- data structures
- python
categories:
- programming
series:
- Cool algorithms
favorite: false
disable_feed: false
---
My [last post]({{< relref "../2024-07-20-treap/index.md" >}}) about the _Treap_
showed an implementation using tree rotations, as is commonly done with [AVL
Trees][avl] and [Red Black Trees][rb].
But the _Treap_ lends itself well to a simple and elegant implementation with no
tree rotations. This makes it especially easy to implement the removal of a key,
rather than the fiddly process of deletion using tree rotations.
[avl]: https://en.wikipedia.org/wiki/AVL_tree
[rb]: https://en.wikipedia.org/wiki/Red%E2%80%93black_tree
<!--more-->
## Implementation
All operations on the tree will be implemented in terms of two fundamental
operations: `split` and `merge`.
We'll be reusing the same structures as in the last post, so let's skip straight
to implementing those fundaments, and building on them for `insert` and
`delete`.
### Split
Splitting a tree means taking a key, and getting the following output:
* a `left` node, root of the tree of all keys lower than the input.
* an extracted `node` which corresponds to the input `key`.
* a `right` node, root of the tree of all keys higher than the input.
```python
type OptionalNode[K, V] = Node[K, V] | None
class SplitResult(NamedTuple):
left: OptionalNode
node: OptionalNode
right: OptionalNode
def split(root: OptionalNode[K, V], key: K) -> SplitResult:
# Base case, empty tree
if root is None:
return SplitResult(None, None, None)
# If we found the key, simply extract left and right
if root.key == key:
left, right = root.left, root.right
root.left, root.right = None, None
return SplitResult(left, root, right)
# Otherwise, recurse on the corresponding side of the tree
if root.key < key:
left, node, right = split(root.right, key)
root.right = left
return SplitResult(root, node, right)
if key < root.key:
left, node, right = split(root.left, key)
root.left = right
return SplitResult(left, node, root)
raise RuntimeError("Unreachable")
```
### Merge
Merging a `left` and `right` tree means (cheaply) building a new tree containing
both of them. A pre-condition for merging is that the `left` tree is composed
entirely of nodes that are lower than any key in `right` (i.e: as in `left` and
`right` after a `split`).
```python
def merge(
left: OptionalNode[K, V],
right: OptionalNode[K, V],
) -> OptionalNode[K, V]:
# Base cases, left or right being empty
if left is None:
return right
if right is None:
return left
# Left has higher priority, it must become the root node
if left.priority >= right.priority:
# We recursively reconstruct its right sub-tree
left.right = merge(left.right, right)
return left
# Right has higher priority, it must become the root node
if left.priority < right.priority:
# We recursively reconstruct its left sub-tree
right.left = merge(left, right.left)
return right
raise RuntimeError("Unreachable")
```
### Insertion
Inserting a node into the tree is done in two steps:
1. `split` the tree to isolate the middle insertion point
2. `merge` it back up to form a full tree with the inserted key
```python
def insert(self, key: K, value: V) -> bool:
# `left` and `right` come before/after the key
left, node, right = split(self._root, key)
was_updated: bool
# Create the node, or update its value, if the key was already in the tree
if node is None:
node = Node(key, value)
was_updated = False
else:
node.value = value
was_updated = True
# Rebuild the tree with a couple of merge operations
self._root = merge(left, merge(node, right))
# Signal whether the key was already in the key
return was_updated
```
### Removal
Removing a key from the tree is similar to inserting a new key, and forgetting
to insert it back: simply `split` the tree and `merge` it back without the
extracted middle node.
```python
def remove(self, key: K) -> bool:
# `node` contains the key, or `None` if the key wasn't in the tree
left, node, right = split(self._root, key)
# Put the tree back together, without the extract node
self._root = merge(left, right)
# Signal whether `key` was mapped in the tree
return node is not None
```

View file

@ -0,0 +1,145 @@
---
title: "Reservoir Sampling"
date: 2024-08-02T18:30:56+01:00
draft: false # I don't care for draft mode, git has branches for that
description: "Elegantly sampling a stream"
tags:
- algorithms
- python
categories:
- programming
series:
- Cool algorithms
favorite: false
disable_feed: false
mathjax: true
---
[_Reservoir Sampling_][reservoir] is an [online][online], probabilistic
algorithm to uniformly sample $k$ random elements out of a stream of values.
It's a particularly elegant and small algorithm, only requiring $\Theta(k)$
amount of space and a single pass through the stream.
[reservoir]: https://en.wikipedia.org/wiki/Reservoir_sampling
[online]: https://en.wikipedia.org/wiki/Online_algorithm
<!--more-->
## Sampling one element
As an introduction, we'll first focus on fairly sampling one element from the
stream.
```python
def sample_one[T](stream: Iterable[T]) -> T:
stream_iter = iter(stream)
# Sample the first element
res = next(stream_iter)
for i, val in enumerate(stream_iter, start=1):
j = random.randint(0, i)
# Replace the sampled element with probability 1/(i + 1)
if j == 0:
res = val
# Return the randomly sampled element
return res
```
### Proof
Let's now prove that this algorithm leads to a fair sampling of the stream.
We'll be doing proof by induction.
#### Hypothesis $H_N$
After iterating through the first $N$ items in the stream,
each of them has had an equal $\frac{1}{N}$ probability of being selected as
`res`.
#### Base Case $H_1$
We can trivially observe that the first element is always assigned to `res`,
$\frac{1}{1} = 1$, the hypothesis has been verified.
#### Inductive Case
For a given $N$, let us assume that $H_N$ holds. Let us now look at the events
of loop iteration where `i = N` (i.e: observation of the $N + 1$-th item in the
stream).
`j = random.randint(0, i)` uniformly selects a value in the range $[0, i]$,
a.k.a $[0, N]$. We then have two cases:
* `j == 0`, with probability $\frac{1}{N + 1}$: we select `val` as the new
reservoir element `res`.
* `j != 0`, with probability $\frac{N}{N + 1}$: we keep the previous value of
`res`. By $H_N$, any of the first $N$ elements had a $\frac{1}{N}$ probability
of being `res` before at the start of the loop, each element now has a
probability $\frac{1}{N} \cdot \frac{N}{N + 1} = \frac{1}{N + 1}$ of being the
element.
And thus, we have proven $H_{N + 1}$ at the end of the loop.
## Sampling $k$ element
The code for sampling $k$ elements is very similar to the one-element case.
```python
def sample[T](stream: Iterable[T], k: int = 1) -> list[T]:
stream_iter = iter(stream)
# Retain the first 'k' elements in the reservoir
res = list(itertools.islice(stream_iter, k))
for i, val in enumerate(stream_iter, start=k):
j = random.randint(0, i)
# Replace one element at random with probability k/(i + 1)
if j < k:
res[j] = val
# Return 'k' randomly sampled elements
return res
```
### Proof
Let us once again do a proof by induction, assuming the stream contains at least
$k$ items.
#### Hypothesis $H_N$
After iterating through the first $N$ items in the stream, each of them has had
an equal $\frac{k}{N}$ probability of being sampled from the stream.
#### Base Case $H_k$
We can trivially observe that the first $k$ element are sampled at the start of
the algorithm, $\frac{k}{k} = 1$, the hypothesis has been verified.
#### Inductive Case
For a given $N$, let us assume that $H_N$ holds. Let us now look at the events
of the loop iteration where `i = N`, in order to prove $H_{N + 1}$.
`j = random.randint(0, i)` uniformly selects a value in the range $[0, i]$,
a.k.a $[0, N]$. We then have three cases:
* `j >= k`, with probability $1 - \frac{k}{N + 1}$: we do not modify the
sampled reservoir at all.
* `j < k`, with probability $\frac{k}{N + 1}$: we sample the new element to
replace the `j`-th element of the reservoir. Therefore for any element
$e \in [0, k[$ we can either have:
* $j = e$: the element _is_ replaced, probability $\frac{1}{k}$.
* $j \neq e$: the element is _not_ replaced, probability $\frac{k - 1}{k}$.
We can now compute the probability that a previously sampled element is kept in
the reservoir:
$1 - \frac{k}{N + 1} + \frac{k}{N + 1} \cdot \frac{k - 1}{k} = \frac{N}{N + 1}$.
By $H_N$, any of the first $N$ elements had a $\frac{k}{N}$ probability
of being sampled before at the start of the loop, each element now has a
probability $\frac{k}{N} \cdot \frac{N}{N + 1} = \frac{k}{N + 1}$ of being the
element.
We have now proven that all elements have a probability $\frac{k}{N + 1}$ of
being sampled at the end of the loop, therefore $H_{N + 1}$ has been verified.

View file

@ -3,11 +3,11 @@
"flake-compat": {
"flake": false,
"locked": {
"lastModified": 1673956053,
"narHash": "sha256-4gtG9iQuiKITOjNQQeQIpoIB6b16fm+504Ch3sNKLd8=",
"lastModified": 1696426674,
"narHash": "sha256-kvjfFW7WAETZlt09AgDn1MrtKzP7t90Vf7vypd3OL1U=",
"owner": "edolstra",
"repo": "flake-compat",
"rev": "35bb57c0c8d8b62bbfd284272c928ceb64ddbde9",
"rev": "0f9255e01c2351cc7d116c072cb317785dd33b33",
"type": "github"
},
"original": {
@ -21,11 +21,11 @@
"systems": "systems"
},
"locked": {
"lastModified": 1689068808,
"narHash": "sha256-6ixXo3wt24N/melDWjq70UuHQLxGV8jZvooRanIHXw0=",
"lastModified": 1710146030,
"narHash": "sha256-SZ5L6eA7HJ/nmkzGG7/ISclqe6oZdOZTNoesiInkXPQ=",
"owner": "numtide",
"repo": "flake-utils",
"rev": "919d646de7be200f3bf08cb76ae1f09402b6f9b4",
"rev": "b1d9ab70662946ef0850d488da1c9019f3a9752a",
"type": "github"
},
"original": {
@ -43,11 +43,11 @@
]
},
"locked": {
"lastModified": 1660459072,
"narHash": "sha256-8DFJjXG8zqoONA1vXtgeKXy68KdJL5UaXR8NtVMUbx8=",
"lastModified": 1709087332,
"narHash": "sha256-HG2cCnktfHsKV0s4XW83gU3F57gaTljL9KNSuG6bnQs=",
"owner": "hercules-ci",
"repo": "gitignore.nix",
"rev": "a20de23b925fd8264fd7fad6454652e142fd7f73",
"rev": "637db329424fd7e46cf4185293b9cc8c88c95394",
"type": "github"
},
"original": {
@ -58,11 +58,11 @@
},
"nixpkgs": {
"locked": {
"lastModified": 1691155369,
"narHash": "sha256-CIuJO5pgwCMsZM8flIU2OiZ79QfDCesXPsAiokCzlNM=",
"lastModified": 1722415718,
"narHash": "sha256-5US0/pgxbMksF92k1+eOa8arJTJiPvsdZj9Dl+vJkM4=",
"owner": "NixOS",
"repo": "nixpkgs",
"rev": "7d050b98e51cdbdd88ad960152d398d41c7ff5b4",
"rev": "c3392ad349a5227f4a3464dce87bcc5046692fce",
"type": "github"
},
"original": {
@ -75,9 +75,6 @@
"pre-commit-hooks": {
"inputs": {
"flake-compat": "flake-compat",
"flake-utils": [
"futils"
],
"gitignore": "gitignore",
"nixpkgs": [
"nixpkgs"
@ -87,11 +84,11 @@
]
},
"locked": {
"lastModified": 1691093055,
"narHash": "sha256-sjNWYpDHc6vx+/M0WbBZKltR0Avh2S43UiDbmYtfHt0=",
"lastModified": 1721042469,
"narHash": "sha256-6FPUl7HVtvRHCCBQne7Ylp4p+dpP3P/OYuzjztZ4s70=",
"owner": "cachix",
"repo": "pre-commit-hooks.nix",
"rev": "ebb43bdacd1af8954d04869c77bc3b61fde515e4",
"rev": "f451c19376071a90d8c58ab1a953c6e9840527fd",
"type": "github"
},
"original": {

View file

@ -22,7 +22,6 @@
repo = "pre-commit-hooks.nix";
ref = "master";
inputs = {
flake-utils.follows = "futils";
nixpkgs.follows = "nixpkgs";
nixpkgs-stable.follows = "nixpkgs";
};

View file

@ -3,6 +3,30 @@
<link rel="stylesheet" type="text/css" href="https://tikzjax.com/v1/fonts.css">
<script async src="https://tikzjax.com/v1/tikzjax.js"></script>
{{ end }}
<!-- Graphviz support -->
{{ if (.Params.graphviz) }}
<script src="https://cdn.jsdelivr.net/npm/@viz-js/viz@3.7.0/lib/viz-standalone.min.js"></script>
<script type="text/javascript">
(function() {
Viz.instance().then(function(viz) {
Array.prototype.forEach.call(document.querySelectorAll("pre.graphviz"), function(x) {
var svg = viz.renderSVGElement(x.innerText);
// Let CSS take care of the SVG size
svg.removeAttribute("width")
svg.setAttribute("height", "auto")
x.replaceChildren(svg)
})
})
})();
</script>
{{ end }}
<!-- Mermaid support -->
{{ if (.Params.mermaid) }}
<script type="module" async>
import mermaid from "https://cdn.jsdelivr.net/npm/mermaid@latest/dist/mermaid.esm.min.mjs";
mermaid.initialize({ startOnLoad: true });
</script>
{{ end }}
{{ with .OutputFormats.Get "atom" -}}
{{ printf `<link rel="%s" type="%s" href="%s" title="%s" />` .Rel .MediaType.Type .Permalink $.Site.Title | safeHTML }}
{{ end -}}

View file

@ -0,0 +1,16 @@
<pre class="graphviz">
{{ with .Get "file" }}
{{ if eq (. | printf "%.1s") "/" }}
{{/* Absolute path are from root of site. */}}
{{ $.Scratch.Set "filepath" . }}
{{ else }}
{{/* Relative paths are from page directory. */}}
{{ $.Scratch.Set "filepath" $.Page.File.Dir }}
{{ $.Scratch.Add "filepath" . }}
{{ end }}
{{ $.Scratch.Get "filepath" | readFile }}
{{ else }}
{{.Inner}}
{{ end }}
</pre>

View file

@ -0,0 +1,16 @@
<pre class="mermaid">
{{ with .Get "file" }}
{{ if eq (. | printf "%.1s") "/" }}
{{/* Absolute path are from root of site. */}}
{{ $.Scratch.Set "filepath" . }}
{{ else }}
{{/* Relative paths are from page directory. */}}
{{ $.Scratch.Set "filepath" $.Page.File.Dir }}
{{ $.Scratch.Add "filepath" . }}
{{ end }}
{{ $.Scratch.Get "filepath" | readFile }}
{{ else }}
{{.Inner}}
{{ end }}
</pre>

View file

@ -1,3 +1,16 @@
<script type="text/tikz">
{{.Inner}}
{{ with .Get "file" }}
{{ if eq (. | printf "%.1s") "/" }}
{{/* Absolute path are from root of site. */}}
{{ $.Scratch.Set "filepath" . }}
{{ else }}
{{/* Relative paths are from page directory. */}}
{{ $.Scratch.Set "filepath" $.Page.File.Dir }}
{{ $.Scratch.Add "filepath" . }}
{{ end }}
{{ $.Scratch.Get "filepath" | readFile }}
{{ else }}
{{.Inner}}
{{ end }}
</script>