Compare commits
40 commits
2dc2ab4659
...
9208b4b874
Author | SHA1 | Date | |
---|---|---|---|
Bruno BELANYI | 9208b4b874 | ||
Bruno BELANYI | 11db5a27b9 | ||
Bruno BELANYI | cc4440c946 | ||
Bruno BELANYI | 7bc3d5c18f | ||
Bruno BELANYI | 806772d883 | ||
Bruno BELANYI | 883f0e7e9b | ||
Bruno BELANYI | 9ff4a07c9b | ||
Bruno BELANYI | eff8152307 | ||
Bruno BELANYI | 652fe81c41 | ||
Bruno BELANYI | 3605445bcf | ||
Bruno BELANYI | 476322a627 | ||
Bruno BELANYI | 0798812f86 | ||
Bruno BELANYI | cd24e9692a | ||
Bruno BELANYI | 5a233e7384 | ||
Bruno BELANYI | dea81f1859 | ||
Bruno BELANYI | d33247b786 | ||
Bruno BELANYI | 62cd0759cf | ||
Bruno BELANYI | 87ef9dd38c | ||
Bruno BELANYI | 2eaa9c4329 | ||
Bruno BELANYI | 19b535ce49 | ||
Bruno BELANYI | a6bbb10098 | ||
Bruno BELANYI | e842737cb6 | ||
Bruno BELANYI | 21fbc24e02 | ||
Bruno BELANYI | 879b671332 | ||
Bruno BELANYI | 9ff51fe82e | ||
Bruno BELANYI | 9ef33b7ff8 | ||
Bruno BELANYI | c97d83d883 | ||
Bruno BELANYI | 768acac4ae | ||
Bruno BELANYI | e8acb49b53 | ||
Bruno BELANYI | 114ca1de50 | ||
Bruno BELANYI | 11138dafd1 | ||
Bruno BELANYI | 84ce6ea494 | ||
Bruno BELANYI | dbbcd528c3 | ||
Bruno BELANYI | 4abcd27ee7 | ||
Bruno BELANYI | 06c4a03a42 | ||
Bruno BELANYI | 4da83c9716 | ||
Bruno BELANYI | 408b74daf7 | ||
Bruno BELANYI | a9f003f4ee | ||
Bruno BELANYI | 51a1bd01cd | ||
Bruno BELANYI | f2fa93ad8b |
3
.markdownlint.yaml
Normal file
3
.markdownlint.yaml
Normal file
|
@ -0,0 +1,3 @@
|
|||
# MD024/no-duplicate-heading/no-duplicate-header
|
||||
MD024:
|
||||
siblings_only: true
|
|
@ -67,6 +67,7 @@ params:
|
|||
webmentions:
|
||||
login: belanyi.fr
|
||||
pingback: true
|
||||
mathjax: true
|
||||
|
||||
taxonomies:
|
||||
category: "categories"
|
||||
|
|
|
@ -8,6 +8,8 @@ tags:
|
|||
categories:
|
||||
favorite: false
|
||||
tikz: true
|
||||
graphviz: true
|
||||
mermaid: true
|
||||
---
|
||||
|
||||
## Test post please ignore
|
||||
|
@ -40,6 +42,29 @@ echo hello world | cut -d' ' -f 1
|
|||
\end{tikzpicture}
|
||||
{{% /tikz %}}
|
||||
|
||||
### Graphviz support
|
||||
|
||||
{{% graphviz %}}
|
||||
graph {
|
||||
a -- b
|
||||
b -- c
|
||||
c -- a
|
||||
}
|
||||
{{% /graphviz %}}
|
||||
|
||||
### Mermaid support
|
||||
|
||||
{{% mermaid %}}
|
||||
graph TD
|
||||
A[Enter Chart Definition] --> B(Preview)
|
||||
B --> C{decide}
|
||||
C --> D[Keep]
|
||||
C --> E[Edit Definition]
|
||||
E --> B
|
||||
D --> F[Save Image and Code]
|
||||
F --> B
|
||||
{{% /graphviz %}}
|
||||
|
||||
### Spoilers
|
||||
|
||||
{{% spoiler "Don't open me" %}}
|
||||
|
|
|
@ -15,7 +15,7 @@ favorite: false
|
|||
disable_feed: false
|
||||
---
|
||||
|
||||
To kickoff the [series]({{< ref "/series/cool-algorithms/">}}) of posts about
|
||||
To kickoff the [series]({{< ref "/series/cool-algorithms/" >}}) of posts about
|
||||
algorithms and data structures I find interesting, I will be talking about my
|
||||
favorite one: the [_Disjoint Set_][wiki]. Also known as the _Union-Find_ data
|
||||
structure, so named because of its two main operations: `ds.union(lhs, rhs)` and
|
||||
|
|
191
content/posts/2024-07-06-gap-buffer/index.md
Normal file
191
content/posts/2024-07-06-gap-buffer/index.md
Normal file
|
@ -0,0 +1,191 @@
|
|||
---
|
||||
title: "Gap Buffer"
|
||||
date: 2024-07-06T21:27:19+01:00
|
||||
draft: false # I don't care for draft mode, git has branches for that
|
||||
description: "As featured in GNU Emacs"
|
||||
tags:
|
||||
- algorithms
|
||||
- data structures
|
||||
- python
|
||||
categories:
|
||||
- programming
|
||||
series:
|
||||
- Cool algorithms
|
||||
favorite: false
|
||||
disable_feed: false
|
||||
---
|
||||
|
||||
The [_Gap Buffer_][wiki] is a popular data structure for text editors to
|
||||
represent files and editable buffers. The most famous of them probably being
|
||||
[GNU Emacs][emacs].
|
||||
|
||||
[wiki]: https://en.wikipedia.org/wiki/Gap_buffer
|
||||
[emacs]: https://www.gnu.org/software/emacs/manual/html_node/elisp/Buffer-Gap.html
|
||||
|
||||
<!--more-->
|
||||
|
||||
## What does it do?
|
||||
|
||||
A _Gap Buffer_ is simply a list of characters, similar to a normal string, with
|
||||
the added twist of splitting it into two side: the prefix and suffix, on either
|
||||
side of the cursor. In between them, a gap is left to allow for quick
|
||||
insertion at the cursor.
|
||||
|
||||
Moving the cursor moves the gap around the buffer, the prefix and suffix getting
|
||||
shorter/longer as required.
|
||||
|
||||
## Implementation
|
||||
|
||||
I'll be writing a sample implementation in Python, as with the rest of the
|
||||
[series]({{< ref "/series/cool-algorithms/" >}}). I don't think it showcases the
|
||||
elegance of the _Gap Buffer_ in action like a C implementation full of
|
||||
`memmove`s would, but it does makes it short and sweet.
|
||||
|
||||
### Representation
|
||||
|
||||
We'll be representing the gap buffer as an actual list of characters.
|
||||
|
||||
Given that Python doesn't _have_ characters, let's settle for a list of strings,
|
||||
each representing a single character...
|
||||
|
||||
```python
|
||||
Char = str
|
||||
|
||||
class GapBuffer:
|
||||
# List of characters, contains prefix and suffix of string with gap in the middle
|
||||
_buf: list[Char]
|
||||
# The gap is contained between [start, end) (i.e: buf[start:end])
|
||||
_gap_start: int
|
||||
_gap_end: int
|
||||
|
||||
# Visual representation of the gap buffer:
|
||||
# This is a very [ ]long string.
|
||||
# |<----------------------------------------------->| capacity
|
||||
# |<------------>| |<-------->| string
|
||||
# |<------------------->| gap
|
||||
# |<------------>| prefix
|
||||
# |<-------->| suffix
|
||||
def __init__(self, initial_capacity: int = 16) -> None:
|
||||
assert initial_capacity > 0
|
||||
# Initialize an empty gap buffer
|
||||
self._buf = [""] * initial_capacity
|
||||
self._gap_start = 0
|
||||
self._gap_end = initial_capacity
|
||||
```
|
||||
|
||||
### Accessors
|
||||
|
||||
I'm mostly adding these for exposition, and making it easier to write `assert`s
|
||||
later.
|
||||
|
||||
```python
|
||||
@property
|
||||
def capacity(self) -> int:
|
||||
return len(self._buf)
|
||||
|
||||
@property
|
||||
def gap_length(self) -> int:
|
||||
return self._gap_end - self._gap_start
|
||||
|
||||
@property
|
||||
def string_length(self) -> int:
|
||||
return self.capacity - self.gap_length
|
||||
|
||||
@property
|
||||
def prefix_length(self) -> int:
|
||||
return self._gap_start
|
||||
|
||||
@property
|
||||
def suffix_length(self) -> int:
|
||||
return self.capacity - self._gap_end
|
||||
```
|
||||
|
||||
### Growing the buffer
|
||||
|
||||
I've written this method in a somewhat non-idiomatic manner, to make it closer
|
||||
to how it would look in C using `realloc` instead.
|
||||
|
||||
It would be more efficient to use slicing to insert the needed extra capacity
|
||||
directly, instead of making a new buffer and copying characters over.
|
||||
|
||||
```python
|
||||
def grow(self, capacity: int) -> None:
|
||||
assert capacity >= self.capacity
|
||||
# Create a new buffer with the new capacity
|
||||
new_buf = [""] * capacity
|
||||
# Move the prefix/suffix to their place in the new buffer
|
||||
added_capacity = capacity - len(self._buf)
|
||||
new_buf[: self._gap_start] = self._buf[: self._gap_start]
|
||||
new_buf[self._gap_end + added_capacity :] = self._buf[self._gap_end :]
|
||||
# Use the new buffer, account for added capacity
|
||||
self._buf = new_buf
|
||||
self._gap_end += added_capacity
|
||||
```
|
||||
|
||||
### Insertion
|
||||
|
||||
Inserting text at the cursor's position means filling up the gap in the middle
|
||||
of the buffer. To do so we must first make sure that the gap is big enough, or
|
||||
grow the buffer accordingly.
|
||||
|
||||
Then inserting the text is simply a matter of copying its characters in place,
|
||||
and moving the start of the gap further right.
|
||||
|
||||
```python
|
||||
def insert(self, val: str) -> None:
|
||||
# Ensure we have enouh space to insert the whole string
|
||||
if len(val) > self.gap_length:
|
||||
self.grow(max(self.capacity * 2, self.string_length + len(val)))
|
||||
# Fill the gap with the given string
|
||||
self._buf[self._gap_start : self._gap_start + len(val)] = val
|
||||
self._gap_start += len(val)
|
||||
```
|
||||
|
||||
### Deletion
|
||||
|
||||
Removing text from the buffer simply expands the gap in the corresponding
|
||||
direction, shortening the string's prefix/suffix. This makes it very cheap.
|
||||
|
||||
The methods are named after the `backspace` and `delete` keys on the keyboard.
|
||||
|
||||
```python
|
||||
def backspace(self, dist: int = 1) -> None:
|
||||
assert dist <= self.prefix_length
|
||||
# Extend gap to the left
|
||||
self._gap_start -= dist
|
||||
|
||||
def delete(self, dist: int = 1) -> None:
|
||||
assert dist <= self.suffix_length
|
||||
# Extend gap to the right
|
||||
self._gap_end += dist
|
||||
```
|
||||
|
||||
### Moving the cursor
|
||||
|
||||
Moving the cursor along the buffer will shift letters from one side of the gap
|
||||
to the other, moving them accross from prefix to suffix and back.
|
||||
|
||||
I find Python's list slicing not quite as elegant to read as a `memmove`, though
|
||||
it does make for a very small and efficient implementation.
|
||||
|
||||
```python
|
||||
def left(self, dist: int = 1) -> None:
|
||||
assert dist <= self.prefix_length
|
||||
# Shift the needed number of characters from end of prefix to start of suffix
|
||||
self._buf[self._gap_end - dist : self._gap_end] = self._buf[
|
||||
self._gap_start - dist : self._gap_start
|
||||
]
|
||||
# Adjust indices accordingly
|
||||
self._gap_start -= dist
|
||||
self._gap_end -= dist
|
||||
|
||||
def right(self, dist: int = 1) -> None:
|
||||
assert dist <= self.suffix_length
|
||||
# Shift the needed number of characters from start of suffix to end of prefix
|
||||
self._buf[self._gap_start : self._gap_start + dist] = self._buf[
|
||||
self._gap_end : self._gap_end + dist
|
||||
]
|
||||
# Adjust indices accordingly
|
||||
self._gap_start += dist
|
||||
self._gap_end += dist
|
||||
```
|
97
content/posts/2024-07-14-bloom-filter/index.md
Normal file
97
content/posts/2024-07-14-bloom-filter/index.md
Normal file
|
@ -0,0 +1,97 @@
|
|||
---
|
||||
title: "Bloom Filter"
|
||||
date: 2024-07-14T17:46:40+01:00
|
||||
draft: false # I don't care for draft mode, git has branches for that
|
||||
description: "Probably cool"
|
||||
tags:
|
||||
- algorithms
|
||||
- data structures
|
||||
- python
|
||||
categories:
|
||||
- programming
|
||||
series:
|
||||
- Cool algorithms
|
||||
favorite: false
|
||||
disable_feed: false
|
||||
---
|
||||
|
||||
The [_Bloom Filter_][wiki] is a probabilistic data structure for set membership.
|
||||
|
||||
The filter can be used as an inexpensive first step when querying the actual
|
||||
data is quite costly (e.g: as a first check for expensive cache lookups or large
|
||||
data seeks).
|
||||
|
||||
[wiki]: https://en.wikipedia.org/wiki/Bloom_filter
|
||||
|
||||
<!--more-->
|
||||
|
||||
## What does it do?
|
||||
|
||||
A _Bloom Filter_ can be understood as a hash-set which can either tell you:
|
||||
|
||||
* An element is _not_ part of the set.
|
||||
* An element _may be_ part of the set.
|
||||
|
||||
More specifically, one can tweak the parameters of the filter to make it so that
|
||||
the _false positive_ rate of membership is quite low.
|
||||
|
||||
I won't be going into those calculations here, but they are quite trivial to
|
||||
compute, or one can just look up appropriate values for their use case.
|
||||
|
||||
## Implementation
|
||||
|
||||
I'll be using Python, which has the nifty ability of representing bitsets
|
||||
through its built-in big integers quite easily.
|
||||
|
||||
We'll be assuming a `BIT_COUNT` of 64 here, but the implementation can easily be
|
||||
tweaked to use a different number, or even change it at construction time.
|
||||
|
||||
### Representation
|
||||
|
||||
A `BloomFilter` is just a set of bits and a list of hash functions.
|
||||
|
||||
```python
|
||||
BIT_COUNT = 64
|
||||
|
||||
class BloomFilter[T]:
|
||||
_bits: int
|
||||
_hash_functions: list[Callable[[T], int]]
|
||||
|
||||
def __init__(self, hash_functions: list[Callable[[T], int]]) -> None:
|
||||
# Filter is initially empty
|
||||
self._bits = 0
|
||||
self._hash_functions = hash_functions
|
||||
```
|
||||
|
||||
### Inserting a key
|
||||
|
||||
To add an element to the filter, we take the output from each hash function and
|
||||
use that to set a bit in the filter. This combination of bit will identify the
|
||||
element, which we can use for lookup later.
|
||||
|
||||
```python
|
||||
def insert(self, val: T) -> None:
|
||||
# Iterate over each hash
|
||||
for f in self._hash_functions:
|
||||
n = f(val) % BIT_COUNT
|
||||
# Set the corresponding bit
|
||||
self._bit |= 1 << n
|
||||
```
|
||||
|
||||
### Querying a key
|
||||
|
||||
Because the _Bloom Filter_ does not actually store its elements, but some
|
||||
derived data from hashing them, it can only definitely say if an element _does
|
||||
not_ belong to it. Otherwise, it _may_ be part of the set, and should be checked
|
||||
against the actual underlying store.
|
||||
|
||||
```python
|
||||
def may_contain(self, val: T) -> bool:
|
||||
for f in self._hash_functions:
|
||||
n = f(val) % BIT_COUNT
|
||||
# If one of the bits is unset, the value is definitely not present
|
||||
if not (self._bit & (1 << n)):
|
||||
return False
|
||||
# All bits were matched, `val` is likely to be part of the set
|
||||
return True
|
||||
```
|
159
content/posts/2024-07-20-treap/index.md
Normal file
159
content/posts/2024-07-20-treap/index.md
Normal file
|
@ -0,0 +1,159 @@
|
|||
---
|
||||
title: "Treap"
|
||||
date: 2024-07-20T14:12:27+01:00
|
||||
draft: false # I don't care for draft mode, git has branches for that
|
||||
description: "A simpler BST"
|
||||
tags:
|
||||
- algorithms
|
||||
- data structures
|
||||
- python
|
||||
categories:
|
||||
- programming
|
||||
series:
|
||||
- Cool algorithms
|
||||
favorite: false
|
||||
disable_feed: false
|
||||
graphviz: true
|
||||
---
|
||||
|
||||
The [_Treap_][wiki] is a mix between a _Binary Search Tree_ and a _Heap_.
|
||||
|
||||
Like a _Binary Search Tree_, it keeps an ordered set of keys in the shape of a
|
||||
tree, allowing for binary search traversal.
|
||||
|
||||
Like a _Heap_, it associates each node with a priority, making sure that a
|
||||
parent's priority is always higher than any of its children.
|
||||
|
||||
[wiki]: https://en.wikipedia.org/wiki/Treap
|
||||
|
||||
<!--more-->
|
||||
|
||||
## What does it do?
|
||||
|
||||
By randomizing the priority value of each key at insertion time, we ensure a
|
||||
high likelihood that the tree stays _roughly_ balanced, avoiding degenerating to
|
||||
unbalanced O(N) height.
|
||||
|
||||
Here's a sample tree created by inserting integers from 0 to 250 into the tree:
|
||||
|
||||
{{< graphviz file="treap.gv" />}}
|
||||
|
||||
## Implementation
|
||||
|
||||
I'll be keeping the theme for this [series] by using Python to implement the
|
||||
_Treap_. This leads to somewhat annoying code to handle the rotation process,
|
||||
which is easier to do in C using pointers.
|
||||
|
||||
[series]: {{< ref "/series/cool-algorithms/" >}}
|
||||
|
||||
### Representation
|
||||
|
||||
Creating a new `Treap` is easy: the tree starts off empty, waiting for new nodes
|
||||
to insert.
|
||||
|
||||
Each `Node` must keep track of the `key`, the mapped `value`, and the node's
|
||||
`priority` (which is assigned randomly). Finally it must also allow for storing
|
||||
two children (`left` and `right`).
|
||||
|
||||
```python
|
||||
class Node[K, V]:
|
||||
key: K
|
||||
value: V
|
||||
priority: float
|
||||
left: Node[K, V] | None
|
||||
righg: Node[K, V] | None
|
||||
|
||||
def __init__(self, key: K, value: V):
|
||||
# Store key and value, like a normal BST node
|
||||
self.key = key
|
||||
self.value = value
|
||||
# Priority is derived randomly
|
||||
self.priority = random()
|
||||
self.left = None
|
||||
self.right = None
|
||||
|
||||
class Treap[K, V]:
|
||||
_root: Node[K, V] | None
|
||||
|
||||
def __init__(self):
|
||||
# The tree starts out empty
|
||||
self._root = None
|
||||
```
|
||||
|
||||
### Search
|
||||
|
||||
Searching the tree is the same as in any other _Binary Search Tree_.
|
||||
|
||||
```python
|
||||
def get(self, key: K) -> T | None:
|
||||
node = self._root
|
||||
# The usual BST traversal
|
||||
while node is not None:
|
||||
if node.key == key:
|
||||
return node.value
|
||||
elif node.key < key:
|
||||
node = node.right
|
||||
else:
|
||||
node = node.left
|
||||
return None
|
||||
```
|
||||
|
||||
### Insertion
|
||||
|
||||
To insert a new `key` into the tree, we identify which leaf position it should
|
||||
be inserted at. We then generate the node's priority, insert it at this
|
||||
position, and rotate the node upwards until the heap property is respected.
|
||||
|
||||
```python
|
||||
type ChildField = Literal["left, right"]
|
||||
|
||||
def insert(self, key: K, value: V) -> bool:
|
||||
# Empty treap base-case
|
||||
if self._root is None:
|
||||
self._root = Node(key, value)
|
||||
# Signal that we're not overwriting the value
|
||||
return False
|
||||
# Keep track of the parent chain for rotation after insertion
|
||||
parents = []
|
||||
node = self._root
|
||||
while node is not None:
|
||||
# Insert a pre-existing key
|
||||
if node.key == key:
|
||||
node.value = value
|
||||
return True
|
||||
# Go down the tree, keep track of the path through the tree
|
||||
field = "left" if key < node.key else "right"
|
||||
parents.append((node, field))
|
||||
node = getattr(node, field)
|
||||
# Key wasn't found, we're inserting a new node
|
||||
child = Node(key, value)
|
||||
parent, field = parents[-1]
|
||||
setattr(parent, field, child)
|
||||
# Rotate the new node up until we respect the decreasing priority property
|
||||
self._rotate_up(child, parents)
|
||||
# Key wasn't found, signal that we inserted a new node
|
||||
return False
|
||||
|
||||
def _rotate_up(
|
||||
self,
|
||||
node: Node[K, V],
|
||||
parents: list[tuple[Node[K, V], ChildField]],
|
||||
) -> None:
|
||||
while parents:
|
||||
parent, field = parents.pop()
|
||||
# If the parent has higher priority, we're done rotating
|
||||
if parent.priority >= node.priority:
|
||||
break
|
||||
# Check for grand-parent/root of tree edge-case
|
||||
if parents:
|
||||
# Update grand-parent to point to the new rotated node
|
||||
grand_parent, field = parents[-1]
|
||||
setattr(grand_parent, field, node)
|
||||
else:
|
||||
# Point the root to the new rotated node
|
||||
self._root = node
|
||||
other_field = "left" if field == "right" else "right"
|
||||
# Rotate the node up
|
||||
setattr(parent, field, getattr(node, other_field))
|
||||
setattr(node, other_field, parent)
|
||||
```
|
1004
content/posts/2024-07-20-treap/treap.gv
Normal file
1004
content/posts/2024-07-20-treap/treap.gv
Normal file
File diff suppressed because it is too large
Load diff
146
content/posts/2024-07-27-treap-revisited/index.md
Normal file
146
content/posts/2024-07-27-treap-revisited/index.md
Normal file
|
@ -0,0 +1,146 @@
|
|||
---
|
||||
title: "Treap, revisited"
|
||||
date: 2024-07-27T14:12:27+01:00
|
||||
draft: false # I don't care for draft mode, git has branches for that
|
||||
description: "An even simpler BST"
|
||||
tags:
|
||||
- algorithms
|
||||
- data structures
|
||||
- python
|
||||
categories:
|
||||
- programming
|
||||
series:
|
||||
- Cool algorithms
|
||||
favorite: false
|
||||
disable_feed: false
|
||||
---
|
||||
|
||||
My [last post]({{< relref "../2024-07-20-treap/index.md" >}}) about the _Treap_
|
||||
showed an implementation using tree rotations, as is commonly done with [AVL
|
||||
Trees][avl] and [Red Black Trees][rb].
|
||||
|
||||
But the _Treap_ lends itself well to a simple and elegant implementation with no
|
||||
tree rotations. This makes it especially easy to implement the removal of a key,
|
||||
rather than the fiddly process of deletion using tree rotations.
|
||||
|
||||
[avl]: https://en.wikipedia.org/wiki/AVL_tree
|
||||
[rb]: https://en.wikipedia.org/wiki/Red%E2%80%93black_tree
|
||||
|
||||
<!--more-->
|
||||
|
||||
## Implementation
|
||||
|
||||
All operations on the tree will be implemented in terms of two fundamental
|
||||
operations: `split` and `merge`.
|
||||
|
||||
We'll be reusing the same structures as in the last post, so let's skip straight
|
||||
to implementing those fundaments, and building on them for `insert` and
|
||||
`delete`.
|
||||
|
||||
### Split
|
||||
|
||||
Splitting a tree means taking a key, and getting the following output:
|
||||
|
||||
* a `left` node, root of the tree of all keys lower than the input.
|
||||
* an extracted `node` which corresponds to the input `key`.
|
||||
* a `right` node, root of the tree of all keys higher than the input.
|
||||
|
||||
```python
|
||||
type OptionalNode[K, V] = Node[K, V] | None
|
||||
|
||||
class SplitResult(NamedTuple):
|
||||
left: OptionalNode
|
||||
node: OptionalNode
|
||||
right: OptionalNode
|
||||
|
||||
def split(root: OptionalNode[K, V], key: K) -> SplitResult:
|
||||
# Base case, empty tree
|
||||
if root is None:
|
||||
return SplitResult(None, None, None)
|
||||
# If we found the key, simply extract left and right
|
||||
if root.key == key:
|
||||
left, right = root.left, root.right
|
||||
root.left, root.right = None, None
|
||||
return SplitResult(left, root, right)
|
||||
# Otherwise, recurse on the corresponding side of the tree
|
||||
if root.key < key:
|
||||
left, node, right = split(root.right, key)
|
||||
root.right = left
|
||||
return SplitResult(root, node, right)
|
||||
if key < root.key:
|
||||
left, node, right = split(root.left, key)
|
||||
root.left = right
|
||||
return SplitResult(left, node, root)
|
||||
raise RuntimeError("Unreachable")
|
||||
```
|
||||
|
||||
### Merge
|
||||
|
||||
Merging a `left` and `right` tree means (cheaply) building a new tree containing
|
||||
both of them. A pre-condition for merging is that the `left` tree is composed
|
||||
entirely of nodes that are lower than any key in `right` (i.e: as in `left` and
|
||||
`right` after a `split`).
|
||||
|
||||
```python
|
||||
def merge(
|
||||
left: OptionalNode[K, V],
|
||||
right: OptionalNode[K, V],
|
||||
) -> OptionalNode[K, V]:
|
||||
# Base cases, left or right being empty
|
||||
if left is None:
|
||||
return right
|
||||
if right is None:
|
||||
return left
|
||||
# Left has higher priority, it must become the root node
|
||||
if left.priority >= right.priority:
|
||||
# We recursively reconstruct its right sub-tree
|
||||
left.right = merge(left.right, right)
|
||||
return left
|
||||
# Right has higher priority, it must become the root node
|
||||
if left.priority < right.priority:
|
||||
# We recursively reconstruct its left sub-tree
|
||||
right.left = merge(left, right.left)
|
||||
return right
|
||||
raise RuntimeError("Unreachable")
|
||||
```
|
||||
|
||||
### Insertion
|
||||
|
||||
Inserting a node into the tree is done in two steps:
|
||||
|
||||
1. `split` the tree to isolate the middle insertion point
|
||||
2. `merge` it back up to form a full tree with the inserted key
|
||||
|
||||
```python
|
||||
def insert(self, key: K, value: V) -> bool:
|
||||
# `left` and `right` come before/after the key
|
||||
left, node, right = split(self._root, key)
|
||||
was_updated: bool
|
||||
# Create the node, or update its value, if the key was already in the tree
|
||||
if node is None:
|
||||
node = Node(key, value)
|
||||
was_updated = False
|
||||
else:
|
||||
node.value = value
|
||||
was_updated = True
|
||||
# Rebuild the tree with a couple of merge operations
|
||||
self._root = merge(left, merge(node, right))
|
||||
# Signal whether the key was already in the key
|
||||
return was_updated
|
||||
```
|
||||
|
||||
### Removal
|
||||
|
||||
Removing a key from the tree is similar to inserting a new key, and forgetting
|
||||
to insert it back: simply `split` the tree and `merge` it back without the
|
||||
extracted middle node.
|
||||
|
||||
```python
|
||||
def remove(self, key: K) -> bool:
|
||||
# `node` contains the key, or `None` if the key wasn't in the tree
|
||||
left, node, right = split(self._root, key)
|
||||
# Put the tree back together, without the extract node
|
||||
self._root = merge(left, right)
|
||||
# Signal whether `key` was mapped in the tree
|
||||
return node is not None
|
||||
```
|
145
content/posts/2024-08-02-reservoir-sampling/index.md
Normal file
145
content/posts/2024-08-02-reservoir-sampling/index.md
Normal file
|
@ -0,0 +1,145 @@
|
|||
---
|
||||
title: "Reservoir Sampling"
|
||||
date: 2024-08-02T18:30:56+01:00
|
||||
draft: false # I don't care for draft mode, git has branches for that
|
||||
description: "Elegantly sampling a stream"
|
||||
tags:
|
||||
- algorithms
|
||||
- python
|
||||
categories:
|
||||
- programming
|
||||
series:
|
||||
- Cool algorithms
|
||||
favorite: false
|
||||
disable_feed: false
|
||||
mathjax: true
|
||||
---
|
||||
|
||||
[_Reservoir Sampling_][reservoir] is an [online][online], probabilistic
|
||||
algorithm to uniformly sample $k$ random elements out of a stream of values.
|
||||
|
||||
It's a particularly elegant and small algorithm, only requiring $\Theta(k)$
|
||||
amount of space and a single pass through the stream.
|
||||
|
||||
[reservoir]: https://en.wikipedia.org/wiki/Reservoir_sampling
|
||||
[online]: https://en.wikipedia.org/wiki/Online_algorithm
|
||||
|
||||
<!--more-->
|
||||
|
||||
## Sampling one element
|
||||
|
||||
As an introduction, we'll first focus on fairly sampling one element from the
|
||||
stream.
|
||||
|
||||
```python
|
||||
def sample_one[T](stream: Iterable[T]) -> T:
|
||||
stream_iter = iter(stream)
|
||||
# Sample the first element
|
||||
res = next(stream_iter)
|
||||
for i, val in enumerate(stream_iter, start=1):
|
||||
j = random.randint(0, i)
|
||||
# Replace the sampled element with probability 1/(i + 1)
|
||||
if j == 0:
|
||||
res = val
|
||||
# Return the randomly sampled element
|
||||
return res
|
||||
```
|
||||
|
||||
### Proof
|
||||
|
||||
Let's now prove that this algorithm leads to a fair sampling of the stream.
|
||||
|
||||
We'll be doing proof by induction.
|
||||
|
||||
#### Hypothesis $H_N$
|
||||
|
||||
After iterating through the first $N$ items in the stream,
|
||||
each of them has had an equal $\frac{1}{N}$ probability of being selected as
|
||||
`res`.
|
||||
|
||||
#### Base Case $H_1$
|
||||
|
||||
We can trivially observe that the first element is always assigned to `res`,
|
||||
$\frac{1}{1} = 1$, the hypothesis has been verified.
|
||||
|
||||
#### Inductive Case
|
||||
|
||||
For a given $N$, let us assume that $H_N$ holds. Let us now look at the events
|
||||
of loop iteration where `i = N` (i.e: observation of the $N + 1$-th item in the
|
||||
stream).
|
||||
|
||||
`j = random.randint(0, i)` uniformly selects a value in the range $[0, i]$,
|
||||
a.k.a $[0, N]$. We then have two cases:
|
||||
|
||||
* `j == 0`, with probability $\frac{1}{N + 1}$: we select `val` as the new
|
||||
reservoir element `res`.
|
||||
|
||||
* `j != 0`, with probability $\frac{N}{N + 1}$: we keep the previous value of
|
||||
`res`. By $H_N$, any of the first $N$ elements had a $\frac{1}{N}$ probability
|
||||
of being `res` before at the start of the loop, each element now has a
|
||||
probability $\frac{1}{N} \cdot \frac{N}{N + 1} = \frac{1}{N + 1}$ of being the
|
||||
element.
|
||||
|
||||
And thus, we have proven $H_{N + 1}$ at the end of the loop.
|
||||
|
||||
## Sampling $k$ element
|
||||
|
||||
The code for sampling $k$ elements is very similar to the one-element case.
|
||||
|
||||
```python
|
||||
def sample[T](stream: Iterable[T], k: int = 1) -> list[T]:
|
||||
stream_iter = iter(stream)
|
||||
# Retain the first 'k' elements in the reservoir
|
||||
res = list(itertools.islice(stream_iter, k))
|
||||
for i, val in enumerate(stream_iter, start=k):
|
||||
j = random.randint(0, i)
|
||||
# Replace one element at random with probability k/(i + 1)
|
||||
if j < k:
|
||||
res[j] = val
|
||||
# Return 'k' randomly sampled elements
|
||||
return res
|
||||
```
|
||||
|
||||
### Proof
|
||||
|
||||
Let us once again do a proof by induction, assuming the stream contains at least
|
||||
$k$ items.
|
||||
|
||||
#### Hypothesis $H_N$
|
||||
|
||||
After iterating through the first $N$ items in the stream, each of them has had
|
||||
an equal $\frac{k}{N}$ probability of being sampled from the stream.
|
||||
|
||||
#### Base Case $H_k$
|
||||
|
||||
We can trivially observe that the first $k$ element are sampled at the start of
|
||||
the algorithm, $\frac{k}{k} = 1$, the hypothesis has been verified.
|
||||
|
||||
#### Inductive Case
|
||||
|
||||
For a given $N$, let us assume that $H_N$ holds. Let us now look at the events
|
||||
of the loop iteration where `i = N`, in order to prove $H_{N + 1}$.
|
||||
|
||||
`j = random.randint(0, i)` uniformly selects a value in the range $[0, i]$,
|
||||
a.k.a $[0, N]$. We then have three cases:
|
||||
|
||||
* `j >= k`, with probability $1 - \frac{k}{N + 1}$: we do not modify the
|
||||
sampled reservoir at all.
|
||||
|
||||
* `j < k`, with probability $\frac{k}{N + 1}$: we sample the new element to
|
||||
replace the `j`-th element of the reservoir. Therefore for any element
|
||||
$e \in [0, k[$ we can either have:
|
||||
* $j = e$: the element _is_ replaced, probability $\frac{1}{k}$.
|
||||
* $j \neq e$: the element is _not_ replaced, probability $\frac{k - 1}{k}$.
|
||||
|
||||
We can now compute the probability that a previously sampled element is kept in
|
||||
the reservoir:
|
||||
$1 - \frac{k}{N + 1} + \frac{k}{N + 1} \cdot \frac{k - 1}{k} = \frac{N}{N + 1}$.
|
||||
|
||||
By $H_N$, any of the first $N$ elements had a $\frac{k}{N}$ probability
|
||||
of being sampled before at the start of the loop, each element now has a
|
||||
probability $\frac{k}{N} \cdot \frac{N}{N + 1} = \frac{k}{N + 1}$ of being the
|
||||
element.
|
||||
|
||||
We have now proven that all elements have a probability $\frac{k}{N + 1}$ of
|
||||
being sampled at the end of the loop, therefore $H_{N + 1}$ has been verified.
|
33
flake.lock
33
flake.lock
|
@ -3,11 +3,11 @@
|
|||
"flake-compat": {
|
||||
"flake": false,
|
||||
"locked": {
|
||||
"lastModified": 1673956053,
|
||||
"narHash": "sha256-4gtG9iQuiKITOjNQQeQIpoIB6b16fm+504Ch3sNKLd8=",
|
||||
"lastModified": 1696426674,
|
||||
"narHash": "sha256-kvjfFW7WAETZlt09AgDn1MrtKzP7t90Vf7vypd3OL1U=",
|
||||
"owner": "edolstra",
|
||||
"repo": "flake-compat",
|
||||
"rev": "35bb57c0c8d8b62bbfd284272c928ceb64ddbde9",
|
||||
"rev": "0f9255e01c2351cc7d116c072cb317785dd33b33",
|
||||
"type": "github"
|
||||
},
|
||||
"original": {
|
||||
|
@ -21,11 +21,11 @@
|
|||
"systems": "systems"
|
||||
},
|
||||
"locked": {
|
||||
"lastModified": 1689068808,
|
||||
"narHash": "sha256-6ixXo3wt24N/melDWjq70UuHQLxGV8jZvooRanIHXw0=",
|
||||
"lastModified": 1710146030,
|
||||
"narHash": "sha256-SZ5L6eA7HJ/nmkzGG7/ISclqe6oZdOZTNoesiInkXPQ=",
|
||||
"owner": "numtide",
|
||||
"repo": "flake-utils",
|
||||
"rev": "919d646de7be200f3bf08cb76ae1f09402b6f9b4",
|
||||
"rev": "b1d9ab70662946ef0850d488da1c9019f3a9752a",
|
||||
"type": "github"
|
||||
},
|
||||
"original": {
|
||||
|
@ -43,11 +43,11 @@
|
|||
]
|
||||
},
|
||||
"locked": {
|
||||
"lastModified": 1660459072,
|
||||
"narHash": "sha256-8DFJjXG8zqoONA1vXtgeKXy68KdJL5UaXR8NtVMUbx8=",
|
||||
"lastModified": 1709087332,
|
||||
"narHash": "sha256-HG2cCnktfHsKV0s4XW83gU3F57gaTljL9KNSuG6bnQs=",
|
||||
"owner": "hercules-ci",
|
||||
"repo": "gitignore.nix",
|
||||
"rev": "a20de23b925fd8264fd7fad6454652e142fd7f73",
|
||||
"rev": "637db329424fd7e46cf4185293b9cc8c88c95394",
|
||||
"type": "github"
|
||||
},
|
||||
"original": {
|
||||
|
@ -58,11 +58,11 @@
|
|||
},
|
||||
"nixpkgs": {
|
||||
"locked": {
|
||||
"lastModified": 1691155369,
|
||||
"narHash": "sha256-CIuJO5pgwCMsZM8flIU2OiZ79QfDCesXPsAiokCzlNM=",
|
||||
"lastModified": 1722415718,
|
||||
"narHash": "sha256-5US0/pgxbMksF92k1+eOa8arJTJiPvsdZj9Dl+vJkM4=",
|
||||
"owner": "NixOS",
|
||||
"repo": "nixpkgs",
|
||||
"rev": "7d050b98e51cdbdd88ad960152d398d41c7ff5b4",
|
||||
"rev": "c3392ad349a5227f4a3464dce87bcc5046692fce",
|
||||
"type": "github"
|
||||
},
|
||||
"original": {
|
||||
|
@ -75,9 +75,6 @@
|
|||
"pre-commit-hooks": {
|
||||
"inputs": {
|
||||
"flake-compat": "flake-compat",
|
||||
"flake-utils": [
|
||||
"futils"
|
||||
],
|
||||
"gitignore": "gitignore",
|
||||
"nixpkgs": [
|
||||
"nixpkgs"
|
||||
|
@ -87,11 +84,11 @@
|
|||
]
|
||||
},
|
||||
"locked": {
|
||||
"lastModified": 1691093055,
|
||||
"narHash": "sha256-sjNWYpDHc6vx+/M0WbBZKltR0Avh2S43UiDbmYtfHt0=",
|
||||
"lastModified": 1721042469,
|
||||
"narHash": "sha256-6FPUl7HVtvRHCCBQne7Ylp4p+dpP3P/OYuzjztZ4s70=",
|
||||
"owner": "cachix",
|
||||
"repo": "pre-commit-hooks.nix",
|
||||
"rev": "ebb43bdacd1af8954d04869c77bc3b61fde515e4",
|
||||
"rev": "f451c19376071a90d8c58ab1a953c6e9840527fd",
|
||||
"type": "github"
|
||||
},
|
||||
"original": {
|
||||
|
|
|
@ -22,7 +22,6 @@
|
|||
repo = "pre-commit-hooks.nix";
|
||||
ref = "master";
|
||||
inputs = {
|
||||
flake-utils.follows = "futils";
|
||||
nixpkgs.follows = "nixpkgs";
|
||||
nixpkgs-stable.follows = "nixpkgs";
|
||||
};
|
||||
|
|
|
@ -3,6 +3,30 @@
|
|||
<link rel="stylesheet" type="text/css" href="https://tikzjax.com/v1/fonts.css">
|
||||
<script async src="https://tikzjax.com/v1/tikzjax.js"></script>
|
||||
{{ end }}
|
||||
<!-- Graphviz support -->
|
||||
{{ if (.Params.graphviz) }}
|
||||
<script src="https://cdn.jsdelivr.net/npm/@viz-js/viz@3.7.0/lib/viz-standalone.min.js"></script>
|
||||
<script type="text/javascript">
|
||||
(function() {
|
||||
Viz.instance().then(function(viz) {
|
||||
Array.prototype.forEach.call(document.querySelectorAll("pre.graphviz"), function(x) {
|
||||
var svg = viz.renderSVGElement(x.innerText);
|
||||
// Let CSS take care of the SVG size
|
||||
svg.removeAttribute("width")
|
||||
svg.setAttribute("height", "auto")
|
||||
x.replaceChildren(svg)
|
||||
})
|
||||
})
|
||||
})();
|
||||
</script>
|
||||
{{ end }}
|
||||
<!-- Mermaid support -->
|
||||
{{ if (.Params.mermaid) }}
|
||||
<script type="module" async>
|
||||
import mermaid from "https://cdn.jsdelivr.net/npm/mermaid@latest/dist/mermaid.esm.min.mjs";
|
||||
mermaid.initialize({ startOnLoad: true });
|
||||
</script>
|
||||
{{ end }}
|
||||
{{ with .OutputFormats.Get "atom" -}}
|
||||
{{ printf `<link rel="%s" type="%s" href="%s" title="%s" />` .Rel .MediaType.Type .Permalink $.Site.Title | safeHTML }}
|
||||
{{ end -}}
|
||||
|
|
16
layouts/shortcodes/graphviz.html
Normal file
16
layouts/shortcodes/graphviz.html
Normal file
|
@ -0,0 +1,16 @@
|
|||
<pre class="graphviz">
|
||||
{{ with .Get "file" }}
|
||||
{{ if eq (. | printf "%.1s") "/" }}
|
||||
{{/* Absolute path are from root of site. */}}
|
||||
{{ $.Scratch.Set "filepath" . }}
|
||||
{{ else }}
|
||||
{{/* Relative paths are from page directory. */}}
|
||||
{{ $.Scratch.Set "filepath" $.Page.File.Dir }}
|
||||
{{ $.Scratch.Add "filepath" . }}
|
||||
{{ end }}
|
||||
|
||||
{{ $.Scratch.Get "filepath" | readFile }}
|
||||
{{ else }}
|
||||
{{.Inner}}
|
||||
{{ end }}
|
||||
</pre>
|
16
layouts/shortcodes/mermaid.html
Normal file
16
layouts/shortcodes/mermaid.html
Normal file
|
@ -0,0 +1,16 @@
|
|||
<pre class="mermaid">
|
||||
{{ with .Get "file" }}
|
||||
{{ if eq (. | printf "%.1s") "/" }}
|
||||
{{/* Absolute path are from root of site. */}}
|
||||
{{ $.Scratch.Set "filepath" . }}
|
||||
{{ else }}
|
||||
{{/* Relative paths are from page directory. */}}
|
||||
{{ $.Scratch.Set "filepath" $.Page.File.Dir }}
|
||||
{{ $.Scratch.Add "filepath" . }}
|
||||
{{ end }}
|
||||
|
||||
{{ $.Scratch.Get "filepath" | readFile }}
|
||||
{{ else }}
|
||||
{{.Inner}}
|
||||
{{ end }}
|
||||
</pre>
|
|
@ -1,3 +1,16 @@
|
|||
<script type="text/tikz">
|
||||
{{.Inner}}
|
||||
{{ with .Get "file" }}
|
||||
{{ if eq (. | printf "%.1s") "/" }}
|
||||
{{/* Absolute path are from root of site. */}}
|
||||
{{ $.Scratch.Set "filepath" . }}
|
||||
{{ else }}
|
||||
{{/* Relative paths are from page directory. */}}
|
||||
{{ $.Scratch.Set "filepath" $.Page.File.Dir }}
|
||||
{{ $.Scratch.Add "filepath" . }}
|
||||
{{ end }}
|
||||
|
||||
{{ $.Scratch.Get "filepath" | readFile }}
|
||||
{{ else }}
|
||||
{{.Inner}}
|
||||
{{ end }}
|
||||
</script>
|
||||
|
|
Loading…
Reference in a new issue