Compare commits

...

40 commits

Author SHA1 Message Date
Bruno BELANYI 9208b4b874 nix: bump inputs
Some checks failed
ci/woodpecker/push/deploy/2 Pipeline failed
2024-08-10 11:56:04 +01:00
Bruno BELANYI 11db5a27b9 Add Reservoir Sampling post 2024-08-10 11:56:04 +01:00
Bruno BELANYI cc4440c946 Add Treap Revisited post 2024-08-10 11:56:04 +01:00
Bruno BELANYI 7bc3d5c18f posts: reservoir-sampling: add k-element sampling 2024-08-10 11:56:04 +01:00
Bruno BELANYI 806772d883 posts: gap-buffer: fix typo 2024-08-10 11:56:04 +01:00
Bruno BELANYI 883f0e7e9b posts: treap: add removal 2024-08-10 11:56:04 +01:00
Bruno BELANYI 9ff4a07c9b posts: reservoir-sampling: add one-element sample 2024-08-10 11:56:04 +01:00
Bruno BELANYI eff8152307 posts: union-find: fix typo 2024-08-10 11:56:04 +01:00
Bruno BELANYI 652fe81c41 posts: treap-revisited: add insertion 2024-08-10 11:56:04 +01:00
Bruno BELANYI 3605445bcf posts: add 'reservoir-sampling' 2024-08-10 11:56:04 +01:00
Bruno BELANYI 476322a627 Add Treap post 2024-08-10 11:56:04 +01:00
Bruno BELANYI 0798812f86 posts: treap: add merge 2024-08-10 11:56:04 +01:00
Bruno BELANYI cd24e9692a markdownlint: relax duplicate header check 2024-08-10 11:56:04 +01:00
Bruno BELANYI 5a233e7384 layouts: add Mermaid support
Similar to Graphviz and TikZ support.
2024-08-10 11:56:04 +01:00
Bruno BELANYI dea81f1859 posts: treap: add insertion 2024-08-10 11:56:04 +01:00
Bruno BELANYI d33247b786 posts: treap-revisited: add split 2024-08-10 11:56:04 +01:00
Bruno BELANYI 62cd0759cf config: enable MathJax 2024-08-10 11:56:04 +01:00
Bruno BELANYI 87ef9dd38c layouts: add Graphviz support
Similar to TikZ support.
2024-08-10 11:56:04 +01:00
Bruno BELANYI 2eaa9c4329 posts: treap: add search 2024-08-10 11:56:04 +01:00
Bruno BELANYI 19b535ce49 posts: treap-revisited: add implementation 2024-08-10 11:56:04 +01:00
Bruno BELANYI a6bbb10098 layouts: tikz: allow using file input
Makes it easier to handle big diagrams.
2024-08-10 11:56:04 +01:00
Bruno BELANYI e842737cb6 posts: treap: add construction 2024-08-10 11:56:04 +01:00
Bruno BELANYI 21fbc24e02 posts: add 'treap-revisited' 2024-08-10 11:56:04 +01:00
Bruno BELANYI 879b671332 Add Bloom Filter post 2024-08-10 11:56:04 +01:00
Bruno BELANYI 9ff51fe82e posts: treap: add presentation 2024-08-10 11:56:04 +01:00
Bruno BELANYI 9ef33b7ff8 Add Gap Buffer post 2024-08-10 11:56:04 +01:00
Bruno BELANYI c97d83d883 posts: bloom-filter: add lookup 2024-08-10 11:56:04 +01:00
Bruno BELANYI 768acac4ae posts: add treap 2024-08-10 11:56:04 +01:00
Bruno BELANYI e8acb49b53 posts: gap-buffer: add movement 2024-08-10 11:56:04 +01:00
Bruno BELANYI 114ca1de50 posts: bloom-filter: add insertion 2024-08-10 11:56:04 +01:00
Bruno BELANYI 11138dafd1 posts: gap-buffer: add deletion 2024-08-10 11:56:04 +01:00
Bruno BELANYI 84ce6ea494 posts: bloom-filter: add construction 2024-08-10 11:56:04 +01:00
Bruno BELANYI dbbcd528c3 posts: gap-buffer: add insertion 2024-08-10 11:56:04 +01:00
Bruno BELANYI 4abcd27ee7 posts: bloom-filter: add presentation 2024-08-10 11:56:04 +01:00
Bruno BELANYI 06c4a03a42 posts: gap-buffer: add growth 2024-08-10 11:56:04 +01:00
Bruno BELANYI 4da83c9716 posts: add bloom-filter 2024-08-10 11:56:04 +01:00
Bruno BELANYI 408b74daf7 posts: gap-buffer: add accessors 2024-08-10 11:56:04 +01:00
Bruno BELANYI a9f003f4ee posts: gap-buffer: add construction 2024-08-10 11:56:04 +01:00
Bruno BELANYI 51a1bd01cd posts: gap-buffer: add presentation 2024-08-10 11:56:04 +01:00
Bruno BELANYI f2fa93ad8b posts: add gap-buffer 2024-08-10 11:56:04 +01:00
16 changed files with 1857 additions and 21 deletions

3
.markdownlint.yaml Normal file
View file

@ -0,0 +1,3 @@
# MD024/no-duplicate-heading/no-duplicate-header
MD024:
siblings_only: true

View file

@ -67,6 +67,7 @@ params:
webmentions:
login: belanyi.fr
pingback: true
mathjax: true
taxonomies:
category: "categories"

View file

@ -8,6 +8,8 @@ tags:
categories:
favorite: false
tikz: true
graphviz: true
mermaid: true
---
## Test post please ignore
@ -40,6 +42,29 @@ echo hello world | cut -d' ' -f 1
\end{tikzpicture}
{{% /tikz %}}
### Graphviz support
{{% graphviz %}}
graph {
a -- b
b -- c
c -- a
}
{{% /graphviz %}}
### Mermaid support
{{% mermaid %}}
graph TD
A[Enter Chart Definition] --> B(Preview)
B --> C{decide}
C --> D[Keep]
C --> E[Edit Definition]
E --> B
D --> F[Save Image and Code]
F --> B
{{% /graphviz %}}
### Spoilers
{{% spoiler "Don't open me" %}}

View file

@ -15,7 +15,7 @@ favorite: false
disable_feed: false
---
To kickoff the [series]({{< ref "/series/cool-algorithms/">}}) of posts about
To kickoff the [series]({{< ref "/series/cool-algorithms/" >}}) of posts about
algorithms and data structures I find interesting, I will be talking about my
favorite one: the [_Disjoint Set_][wiki]. Also known as the _Union-Find_ data
structure, so named because of its two main operations: `ds.union(lhs, rhs)` and

View file

@ -0,0 +1,191 @@
---
title: "Gap Buffer"
date: 2024-07-06T21:27:19+01:00
draft: false # I don't care for draft mode, git has branches for that
description: "As featured in GNU Emacs"
tags:
- algorithms
- data structures
- python
categories:
- programming
series:
- Cool algorithms
favorite: false
disable_feed: false
---
The [_Gap Buffer_][wiki] is a popular data structure for text editors to
represent files and editable buffers. The most famous of them probably being
[GNU Emacs][emacs].
[wiki]: https://en.wikipedia.org/wiki/Gap_buffer
[emacs]: https://www.gnu.org/software/emacs/manual/html_node/elisp/Buffer-Gap.html
<!--more-->
## What does it do?
A _Gap Buffer_ is simply a list of characters, similar to a normal string, with
the added twist of splitting it into two side: the prefix and suffix, on either
side of the cursor. In between them, a gap is left to allow for quick
insertion at the cursor.
Moving the cursor moves the gap around the buffer, the prefix and suffix getting
shorter/longer as required.
## Implementation
I'll be writing a sample implementation in Python, as with the rest of the
[series]({{< ref "/series/cool-algorithms/" >}}). I don't think it showcases the
elegance of the _Gap Buffer_ in action like a C implementation full of
`memmove`s would, but it does makes it short and sweet.
### Representation
We'll be representing the gap buffer as an actual list of characters.
Given that Python doesn't _have_ characters, let's settle for a list of strings,
each representing a single character...
```python
Char = str
class GapBuffer:
# List of characters, contains prefix and suffix of string with gap in the middle
_buf: list[Char]
# The gap is contained between [start, end) (i.e: buf[start:end])
_gap_start: int
_gap_end: int
# Visual representation of the gap buffer:
# This is a very [ ]long string.
# |<----------------------------------------------->| capacity
# |<------------>| |<-------->| string
# |<------------------->| gap
# |<------------>| prefix
# |<-------->| suffix
def __init__(self, initial_capacity: int = 16) -> None:
assert initial_capacity > 0
# Initialize an empty gap buffer
self._buf = [""] * initial_capacity
self._gap_start = 0
self._gap_end = initial_capacity
```
### Accessors
I'm mostly adding these for exposition, and making it easier to write `assert`s
later.
```python
@property
def capacity(self) -> int:
return len(self._buf)
@property
def gap_length(self) -> int:
return self._gap_end - self._gap_start
@property
def string_length(self) -> int:
return self.capacity - self.gap_length
@property
def prefix_length(self) -> int:
return self._gap_start
@property
def suffix_length(self) -> int:
return self.capacity - self._gap_end
```
### Growing the buffer
I've written this method in a somewhat non-idiomatic manner, to make it closer
to how it would look in C using `realloc` instead.
It would be more efficient to use slicing to insert the needed extra capacity
directly, instead of making a new buffer and copying characters over.
```python
def grow(self, capacity: int) -> None:
assert capacity >= self.capacity
# Create a new buffer with the new capacity
new_buf = [""] * capacity
# Move the prefix/suffix to their place in the new buffer
added_capacity = capacity - len(self._buf)
new_buf[: self._gap_start] = self._buf[: self._gap_start]
new_buf[self._gap_end + added_capacity :] = self._buf[self._gap_end :]
# Use the new buffer, account for added capacity
self._buf = new_buf
self._gap_end += added_capacity
```
### Insertion
Inserting text at the cursor's position means filling up the gap in the middle
of the buffer. To do so we must first make sure that the gap is big enough, or
grow the buffer accordingly.
Then inserting the text is simply a matter of copying its characters in place,
and moving the start of the gap further right.
```python
def insert(self, val: str) -> None:
# Ensure we have enouh space to insert the whole string
if len(val) > self.gap_length:
self.grow(max(self.capacity * 2, self.string_length + len(val)))
# Fill the gap with the given string
self._buf[self._gap_start : self._gap_start + len(val)] = val
self._gap_start += len(val)
```
### Deletion
Removing text from the buffer simply expands the gap in the corresponding
direction, shortening the string's prefix/suffix. This makes it very cheap.
The methods are named after the `backspace` and `delete` keys on the keyboard.
```python
def backspace(self, dist: int = 1) -> None:
assert dist <= self.prefix_length
# Extend gap to the left
self._gap_start -= dist
def delete(self, dist: int = 1) -> None:
assert dist <= self.suffix_length
# Extend gap to the right
self._gap_end += dist
```
### Moving the cursor
Moving the cursor along the buffer will shift letters from one side of the gap
to the other, moving them accross from prefix to suffix and back.
I find Python's list slicing not quite as elegant to read as a `memmove`, though
it does make for a very small and efficient implementation.
```python
def left(self, dist: int = 1) -> None:
assert dist <= self.prefix_length
# Shift the needed number of characters from end of prefix to start of suffix
self._buf[self._gap_end - dist : self._gap_end] = self._buf[
self._gap_start - dist : self._gap_start
]
# Adjust indices accordingly
self._gap_start -= dist
self._gap_end -= dist
def right(self, dist: int = 1) -> None:
assert dist <= self.suffix_length
# Shift the needed number of characters from start of suffix to end of prefix
self._buf[self._gap_start : self._gap_start + dist] = self._buf[
self._gap_end : self._gap_end + dist
]
# Adjust indices accordingly
self._gap_start += dist
self._gap_end += dist
```

View file

@ -0,0 +1,97 @@
---
title: "Bloom Filter"
date: 2024-07-14T17:46:40+01:00
draft: false # I don't care for draft mode, git has branches for that
description: "Probably cool"
tags:
- algorithms
- data structures
- python
categories:
- programming
series:
- Cool algorithms
favorite: false
disable_feed: false
---
The [_Bloom Filter_][wiki] is a probabilistic data structure for set membership.
The filter can be used as an inexpensive first step when querying the actual
data is quite costly (e.g: as a first check for expensive cache lookups or large
data seeks).
[wiki]: https://en.wikipedia.org/wiki/Bloom_filter
<!--more-->
## What does it do?
A _Bloom Filter_ can be understood as a hash-set which can either tell you:
* An element is _not_ part of the set.
* An element _may be_ part of the set.
More specifically, one can tweak the parameters of the filter to make it so that
the _false positive_ rate of membership is quite low.
I won't be going into those calculations here, but they are quite trivial to
compute, or one can just look up appropriate values for their use case.
## Implementation
I'll be using Python, which has the nifty ability of representing bitsets
through its built-in big integers quite easily.
We'll be assuming a `BIT_COUNT` of 64 here, but the implementation can easily be
tweaked to use a different number, or even change it at construction time.
### Representation
A `BloomFilter` is just a set of bits and a list of hash functions.
```python
BIT_COUNT = 64
class BloomFilter[T]:
_bits: int
_hash_functions: list[Callable[[T], int]]
def __init__(self, hash_functions: list[Callable[[T], int]]) -> None:
# Filter is initially empty
self._bits = 0
self._hash_functions = hash_functions
```
### Inserting a key
To add an element to the filter, we take the output from each hash function and
use that to set a bit in the filter. This combination of bit will identify the
element, which we can use for lookup later.
```python
def insert(self, val: T) -> None:
# Iterate over each hash
for f in self._hash_functions:
n = f(val) % BIT_COUNT
# Set the corresponding bit
self._bit |= 1 << n
```
### Querying a key
Because the _Bloom Filter_ does not actually store its elements, but some
derived data from hashing them, it can only definitely say if an element _does
not_ belong to it. Otherwise, it _may_ be part of the set, and should be checked
against the actual underlying store.
```python
def may_contain(self, val: T) -> bool:
for f in self._hash_functions:
n = f(val) % BIT_COUNT
# If one of the bits is unset, the value is definitely not present
if not (self._bit & (1 << n)):
return False
# All bits were matched, `val` is likely to be part of the set
return True
```

View file

@ -0,0 +1,159 @@
---
title: "Treap"
date: 2024-07-20T14:12:27+01:00
draft: false # I don't care for draft mode, git has branches for that
description: "A simpler BST"
tags:
- algorithms
- data structures
- python
categories:
- programming
series:
- Cool algorithms
favorite: false
disable_feed: false
graphviz: true
---
The [_Treap_][wiki] is a mix between a _Binary Search Tree_ and a _Heap_.
Like a _Binary Search Tree_, it keeps an ordered set of keys in the shape of a
tree, allowing for binary search traversal.
Like a _Heap_, it associates each node with a priority, making sure that a
parent's priority is always higher than any of its children.
[wiki]: https://en.wikipedia.org/wiki/Treap
<!--more-->
## What does it do?
By randomizing the priority value of each key at insertion time, we ensure a
high likelihood that the tree stays _roughly_ balanced, avoiding degenerating to
unbalanced O(N) height.
Here's a sample tree created by inserting integers from 0 to 250 into the tree:
{{< graphviz file="treap.gv" />}}
## Implementation
I'll be keeping the theme for this [series] by using Python to implement the
_Treap_. This leads to somewhat annoying code to handle the rotation process,
which is easier to do in C using pointers.
[series]: {{< ref "/series/cool-algorithms/" >}}
### Representation
Creating a new `Treap` is easy: the tree starts off empty, waiting for new nodes
to insert.
Each `Node` must keep track of the `key`, the mapped `value`, and the node's
`priority` (which is assigned randomly). Finally it must also allow for storing
two children (`left` and `right`).
```python
class Node[K, V]:
key: K
value: V
priority: float
left: Node[K, V] | None
righg: Node[K, V] | None
def __init__(self, key: K, value: V):
# Store key and value, like a normal BST node
self.key = key
self.value = value
# Priority is derived randomly
self.priority = random()
self.left = None
self.right = None
class Treap[K, V]:
_root: Node[K, V] | None
def __init__(self):
# The tree starts out empty
self._root = None
```
### Search
Searching the tree is the same as in any other _Binary Search Tree_.
```python
def get(self, key: K) -> T | None:
node = self._root
# The usual BST traversal
while node is not None:
if node.key == key:
return node.value
elif node.key < key:
node = node.right
else:
node = node.left
return None
```
### Insertion
To insert a new `key` into the tree, we identify which leaf position it should
be inserted at. We then generate the node's priority, insert it at this
position, and rotate the node upwards until the heap property is respected.
```python
type ChildField = Literal["left, right"]
def insert(self, key: K, value: V) -> bool:
# Empty treap base-case
if self._root is None:
self._root = Node(key, value)
# Signal that we're not overwriting the value
return False
# Keep track of the parent chain for rotation after insertion
parents = []
node = self._root
while node is not None:
# Insert a pre-existing key
if node.key == key:
node.value = value
return True
# Go down the tree, keep track of the path through the tree
field = "left" if key < node.key else "right"
parents.append((node, field))
node = getattr(node, field)
# Key wasn't found, we're inserting a new node
child = Node(key, value)
parent, field = parents[-1]
setattr(parent, field, child)
# Rotate the new node up until we respect the decreasing priority property
self._rotate_up(child, parents)
# Key wasn't found, signal that we inserted a new node
return False
def _rotate_up(
self,
node: Node[K, V],
parents: list[tuple[Node[K, V], ChildField]],
) -> None:
while parents:
parent, field = parents.pop()
# If the parent has higher priority, we're done rotating
if parent.priority >= node.priority:
break
# Check for grand-parent/root of tree edge-case
if parents:
# Update grand-parent to point to the new rotated node
grand_parent, field = parents[-1]
setattr(grand_parent, field, node)
else:
# Point the root to the new rotated node
self._root = node
other_field = "left" if field == "right" else "right"
# Rotate the node up
setattr(parent, field, getattr(node, other_field))
setattr(node, other_field, parent)
```

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,146 @@
---
title: "Treap, revisited"
date: 2024-07-27T14:12:27+01:00
draft: false # I don't care for draft mode, git has branches for that
description: "An even simpler BST"
tags:
- algorithms
- data structures
- python
categories:
- programming
series:
- Cool algorithms
favorite: false
disable_feed: false
---
My [last post]({{< relref "../2024-07-20-treap/index.md" >}}) about the _Treap_
showed an implementation using tree rotations, as is commonly done with [AVL
Trees][avl] and [Red Black Trees][rb].
But the _Treap_ lends itself well to a simple and elegant implementation with no
tree rotations. This makes it especially easy to implement the removal of a key,
rather than the fiddly process of deletion using tree rotations.
[avl]: https://en.wikipedia.org/wiki/AVL_tree
[rb]: https://en.wikipedia.org/wiki/Red%E2%80%93black_tree
<!--more-->
## Implementation
All operations on the tree will be implemented in terms of two fundamental
operations: `split` and `merge`.
We'll be reusing the same structures as in the last post, so let's skip straight
to implementing those fundaments, and building on them for `insert` and
`delete`.
### Split
Splitting a tree means taking a key, and getting the following output:
* a `left` node, root of the tree of all keys lower than the input.
* an extracted `node` which corresponds to the input `key`.
* a `right` node, root of the tree of all keys higher than the input.
```python
type OptionalNode[K, V] = Node[K, V] | None
class SplitResult(NamedTuple):
left: OptionalNode
node: OptionalNode
right: OptionalNode
def split(root: OptionalNode[K, V], key: K) -> SplitResult:
# Base case, empty tree
if root is None:
return SplitResult(None, None, None)
# If we found the key, simply extract left and right
if root.key == key:
left, right = root.left, root.right
root.left, root.right = None, None
return SplitResult(left, root, right)
# Otherwise, recurse on the corresponding side of the tree
if root.key < key:
left, node, right = split(root.right, key)
root.right = left
return SplitResult(root, node, right)
if key < root.key:
left, node, right = split(root.left, key)
root.left = right
return SplitResult(left, node, root)
raise RuntimeError("Unreachable")
```
### Merge
Merging a `left` and `right` tree means (cheaply) building a new tree containing
both of them. A pre-condition for merging is that the `left` tree is composed
entirely of nodes that are lower than any key in `right` (i.e: as in `left` and
`right` after a `split`).
```python
def merge(
left: OptionalNode[K, V],
right: OptionalNode[K, V],
) -> OptionalNode[K, V]:
# Base cases, left or right being empty
if left is None:
return right
if right is None:
return left
# Left has higher priority, it must become the root node
if left.priority >= right.priority:
# We recursively reconstruct its right sub-tree
left.right = merge(left.right, right)
return left
# Right has higher priority, it must become the root node
if left.priority < right.priority:
# We recursively reconstruct its left sub-tree
right.left = merge(left, right.left)
return right
raise RuntimeError("Unreachable")
```
### Insertion
Inserting a node into the tree is done in two steps:
1. `split` the tree to isolate the middle insertion point
2. `merge` it back up to form a full tree with the inserted key
```python
def insert(self, key: K, value: V) -> bool:
# `left` and `right` come before/after the key
left, node, right = split(self._root, key)
was_updated: bool
# Create the node, or update its value, if the key was already in the tree
if node is None:
node = Node(key, value)
was_updated = False
else:
node.value = value
was_updated = True
# Rebuild the tree with a couple of merge operations
self._root = merge(left, merge(node, right))
# Signal whether the key was already in the key
return was_updated
```
### Removal
Removing a key from the tree is similar to inserting a new key, and forgetting
to insert it back: simply `split` the tree and `merge` it back without the
extracted middle node.
```python
def remove(self, key: K) -> bool:
# `node` contains the key, or `None` if the key wasn't in the tree
left, node, right = split(self._root, key)
# Put the tree back together, without the extract node
self._root = merge(left, right)
# Signal whether `key` was mapped in the tree
return node is not None
```

View file

@ -0,0 +1,145 @@
---
title: "Reservoir Sampling"
date: 2024-08-02T18:30:56+01:00
draft: false # I don't care for draft mode, git has branches for that
description: "Elegantly sampling a stream"
tags:
- algorithms
- python
categories:
- programming
series:
- Cool algorithms
favorite: false
disable_feed: false
mathjax: true
---
[_Reservoir Sampling_][reservoir] is an [online][online], probabilistic
algorithm to uniformly sample $k$ random elements out of a stream of values.
It's a particularly elegant and small algorithm, only requiring $\Theta(k)$
amount of space and a single pass through the stream.
[reservoir]: https://en.wikipedia.org/wiki/Reservoir_sampling
[online]: https://en.wikipedia.org/wiki/Online_algorithm
<!--more-->
## Sampling one element
As an introduction, we'll first focus on fairly sampling one element from the
stream.
```python
def sample_one[T](stream: Iterable[T]) -> T:
stream_iter = iter(stream)
# Sample the first element
res = next(stream_iter)
for i, val in enumerate(stream_iter, start=1):
j = random.randint(0, i)
# Replace the sampled element with probability 1/(i + 1)
if j == 0:
res = val
# Return the randomly sampled element
return res
```
### Proof
Let's now prove that this algorithm leads to a fair sampling of the stream.
We'll be doing proof by induction.
#### Hypothesis $H_N$
After iterating through the first $N$ items in the stream,
each of them has had an equal $\frac{1}{N}$ probability of being selected as
`res`.
#### Base Case $H_1$
We can trivially observe that the first element is always assigned to `res`,
$\frac{1}{1} = 1$, the hypothesis has been verified.
#### Inductive Case
For a given $N$, let us assume that $H_N$ holds. Let us now look at the events
of loop iteration where `i = N` (i.e: observation of the $N + 1$-th item in the
stream).
`j = random.randint(0, i)` uniformly selects a value in the range $[0, i]$,
a.k.a $[0, N]$. We then have two cases:
* `j == 0`, with probability $\frac{1}{N + 1}$: we select `val` as the new
reservoir element `res`.
* `j != 0`, with probability $\frac{N}{N + 1}$: we keep the previous value of
`res`. By $H_N$, any of the first $N$ elements had a $\frac{1}{N}$ probability
of being `res` before at the start of the loop, each element now has a
probability $\frac{1}{N} \cdot \frac{N}{N + 1} = \frac{1}{N + 1}$ of being the
element.
And thus, we have proven $H_{N + 1}$ at the end of the loop.
## Sampling $k$ element
The code for sampling $k$ elements is very similar to the one-element case.
```python
def sample[T](stream: Iterable[T], k: int = 1) -> list[T]:
stream_iter = iter(stream)
# Retain the first 'k' elements in the reservoir
res = list(itertools.islice(stream_iter, k))
for i, val in enumerate(stream_iter, start=k):
j = random.randint(0, i)
# Replace one element at random with probability k/(i + 1)
if j < k:
res[j] = val
# Return 'k' randomly sampled elements
return res
```
### Proof
Let us once again do a proof by induction, assuming the stream contains at least
$k$ items.
#### Hypothesis $H_N$
After iterating through the first $N$ items in the stream, each of them has had
an equal $\frac{k}{N}$ probability of being sampled from the stream.
#### Base Case $H_k$
We can trivially observe that the first $k$ element are sampled at the start of
the algorithm, $\frac{k}{k} = 1$, the hypothesis has been verified.
#### Inductive Case
For a given $N$, let us assume that $H_N$ holds. Let us now look at the events
of the loop iteration where `i = N`, in order to prove $H_{N + 1}$.
`j = random.randint(0, i)` uniformly selects a value in the range $[0, i]$,
a.k.a $[0, N]$. We then have three cases:
* `j >= k`, with probability $1 - \frac{k}{N + 1}$: we do not modify the
sampled reservoir at all.
* `j < k`, with probability $\frac{k}{N + 1}$: we sample the new element to
replace the `j`-th element of the reservoir. Therefore for any element
$e \in [0, k[$ we can either have:
* $j = e$: the element _is_ replaced, probability $\frac{1}{k}$.
* $j \neq e$: the element is _not_ replaced, probability $\frac{k - 1}{k}$.
We can now compute the probability that a previously sampled element is kept in
the reservoir:
$1 - \frac{k}{N + 1} + \frac{k}{N + 1} \cdot \frac{k - 1}{k} = \frac{N}{N + 1}$.
By $H_N$, any of the first $N$ elements had a $\frac{k}{N}$ probability
of being sampled before at the start of the loop, each element now has a
probability $\frac{k}{N} \cdot \frac{N}{N + 1} = \frac{k}{N + 1}$ of being the
element.
We have now proven that all elements have a probability $\frac{k}{N + 1}$ of
being sampled at the end of the loop, therefore $H_{N + 1}$ has been verified.

View file

@ -3,11 +3,11 @@
"flake-compat": {
"flake": false,
"locked": {
"lastModified": 1673956053,
"narHash": "sha256-4gtG9iQuiKITOjNQQeQIpoIB6b16fm+504Ch3sNKLd8=",
"lastModified": 1696426674,
"narHash": "sha256-kvjfFW7WAETZlt09AgDn1MrtKzP7t90Vf7vypd3OL1U=",
"owner": "edolstra",
"repo": "flake-compat",
"rev": "35bb57c0c8d8b62bbfd284272c928ceb64ddbde9",
"rev": "0f9255e01c2351cc7d116c072cb317785dd33b33",
"type": "github"
},
"original": {
@ -21,11 +21,11 @@
"systems": "systems"
},
"locked": {
"lastModified": 1689068808,
"narHash": "sha256-6ixXo3wt24N/melDWjq70UuHQLxGV8jZvooRanIHXw0=",
"lastModified": 1710146030,
"narHash": "sha256-SZ5L6eA7HJ/nmkzGG7/ISclqe6oZdOZTNoesiInkXPQ=",
"owner": "numtide",
"repo": "flake-utils",
"rev": "919d646de7be200f3bf08cb76ae1f09402b6f9b4",
"rev": "b1d9ab70662946ef0850d488da1c9019f3a9752a",
"type": "github"
},
"original": {
@ -43,11 +43,11 @@
]
},
"locked": {
"lastModified": 1660459072,
"narHash": "sha256-8DFJjXG8zqoONA1vXtgeKXy68KdJL5UaXR8NtVMUbx8=",
"lastModified": 1709087332,
"narHash": "sha256-HG2cCnktfHsKV0s4XW83gU3F57gaTljL9KNSuG6bnQs=",
"owner": "hercules-ci",
"repo": "gitignore.nix",
"rev": "a20de23b925fd8264fd7fad6454652e142fd7f73",
"rev": "637db329424fd7e46cf4185293b9cc8c88c95394",
"type": "github"
},
"original": {
@ -58,11 +58,11 @@
},
"nixpkgs": {
"locked": {
"lastModified": 1691155369,
"narHash": "sha256-CIuJO5pgwCMsZM8flIU2OiZ79QfDCesXPsAiokCzlNM=",
"lastModified": 1722415718,
"narHash": "sha256-5US0/pgxbMksF92k1+eOa8arJTJiPvsdZj9Dl+vJkM4=",
"owner": "NixOS",
"repo": "nixpkgs",
"rev": "7d050b98e51cdbdd88ad960152d398d41c7ff5b4",
"rev": "c3392ad349a5227f4a3464dce87bcc5046692fce",
"type": "github"
},
"original": {
@ -75,9 +75,6 @@
"pre-commit-hooks": {
"inputs": {
"flake-compat": "flake-compat",
"flake-utils": [
"futils"
],
"gitignore": "gitignore",
"nixpkgs": [
"nixpkgs"
@ -87,11 +84,11 @@
]
},
"locked": {
"lastModified": 1691093055,
"narHash": "sha256-sjNWYpDHc6vx+/M0WbBZKltR0Avh2S43UiDbmYtfHt0=",
"lastModified": 1721042469,
"narHash": "sha256-6FPUl7HVtvRHCCBQne7Ylp4p+dpP3P/OYuzjztZ4s70=",
"owner": "cachix",
"repo": "pre-commit-hooks.nix",
"rev": "ebb43bdacd1af8954d04869c77bc3b61fde515e4",
"rev": "f451c19376071a90d8c58ab1a953c6e9840527fd",
"type": "github"
},
"original": {

View file

@ -22,7 +22,6 @@
repo = "pre-commit-hooks.nix";
ref = "master";
inputs = {
flake-utils.follows = "futils";
nixpkgs.follows = "nixpkgs";
nixpkgs-stable.follows = "nixpkgs";
};

View file

@ -3,6 +3,30 @@
<link rel="stylesheet" type="text/css" href="https://tikzjax.com/v1/fonts.css">
<script async src="https://tikzjax.com/v1/tikzjax.js"></script>
{{ end }}
<!-- Graphviz support -->
{{ if (.Params.graphviz) }}
<script src="https://cdn.jsdelivr.net/npm/@viz-js/viz@3.7.0/lib/viz-standalone.min.js"></script>
<script type="text/javascript">
(function() {
Viz.instance().then(function(viz) {
Array.prototype.forEach.call(document.querySelectorAll("pre.graphviz"), function(x) {
var svg = viz.renderSVGElement(x.innerText);
// Let CSS take care of the SVG size
svg.removeAttribute("width")
svg.setAttribute("height", "auto")
x.replaceChildren(svg)
})
})
})();
</script>
{{ end }}
<!-- Mermaid support -->
{{ if (.Params.mermaid) }}
<script type="module" async>
import mermaid from "https://cdn.jsdelivr.net/npm/mermaid@latest/dist/mermaid.esm.min.mjs";
mermaid.initialize({ startOnLoad: true });
</script>
{{ end }}
{{ with .OutputFormats.Get "atom" -}}
{{ printf `<link rel="%s" type="%s" href="%s" title="%s" />` .Rel .MediaType.Type .Permalink $.Site.Title | safeHTML }}
{{ end -}}

View file

@ -0,0 +1,16 @@
<pre class="graphviz">
{{ with .Get "file" }}
{{ if eq (. | printf "%.1s") "/" }}
{{/* Absolute path are from root of site. */}}
{{ $.Scratch.Set "filepath" . }}
{{ else }}
{{/* Relative paths are from page directory. */}}
{{ $.Scratch.Set "filepath" $.Page.File.Dir }}
{{ $.Scratch.Add "filepath" . }}
{{ end }}
{{ $.Scratch.Get "filepath" | readFile }}
{{ else }}
{{.Inner}}
{{ end }}
</pre>

View file

@ -0,0 +1,16 @@
<pre class="mermaid">
{{ with .Get "file" }}
{{ if eq (. | printf "%.1s") "/" }}
{{/* Absolute path are from root of site. */}}
{{ $.Scratch.Set "filepath" . }}
{{ else }}
{{/* Relative paths are from page directory. */}}
{{ $.Scratch.Set "filepath" $.Page.File.Dir }}
{{ $.Scratch.Add "filepath" . }}
{{ end }}
{{ $.Scratch.Get "filepath" | readFile }}
{{ else }}
{{.Inner}}
{{ end }}
</pre>

View file

@ -1,3 +1,16 @@
<script type="text/tikz">
{{.Inner}}
{{ with .Get "file" }}
{{ if eq (. | printf "%.1s") "/" }}
{{/* Absolute path are from root of site. */}}
{{ $.Scratch.Set "filepath" . }}
{{ else }}
{{/* Relative paths are from page directory. */}}
{{ $.Scratch.Set "filepath" $.Page.File.Dir }}
{{ $.Scratch.Add "filepath" . }}
{{ end }}
{{ $.Scratch.Get "filepath" | readFile }}
{{ else }}
{{.Inner}}
{{ end }}
</script>