Compare commits

...

10 commits

Author SHA1 Message Date
Bruno BELANYI 763ee444d4 Add Bloom Filter post
All checks were successful
ci/woodpecker/push/deploy/1 Pipeline was successful
ci/woodpecker/push/deploy/2 Pipeline was successful
ci/woodpecker/cron/deploy/2 Pipeline was successful
2024-07-14 17:57:36 +01:00
Bruno BELANYI 5e3ba4fb04 posts: bloom-filter: add lookup 2024-07-14 17:57:04 +01:00
Bruno BELANYI 0030310952 posts: bloom-filter: add insertion 2024-07-14 17:56:19 +01:00
Bruno BELANYI dda444bdc0 posts: bloom-filter: add construction 2024-07-14 17:55:15 +01:00
Bruno BELANYI aea5587742 posts: bloom-filter: add presentation 2024-07-14 17:54:59 +01:00
Bruno BELANYI c13abdc134 Add Gap Buffer post 2024-07-14 17:54:27 +01:00
Bruno BELANYI 987078068f posts: add bloom-filter 2024-07-14 17:54:27 +01:00
Bruno BELANYI f0b3c77862 posts: gap-buffer: add movement 2024-07-14 17:54:27 +01:00
Bruno BELANYI 6a1c074e32 posts: gap-buffer: add deletion 2024-07-14 17:54:27 +01:00
Bruno BELANYI c413bb82a4 posts: gap-buffer: add insertion 2024-07-14 17:54:27 +01:00
2 changed files with 165 additions and 0 deletions

View file

@ -121,3 +121,71 @@ def grow(self, capacity: int) -> None:
self._buf = new_buf
self._gap_end += added_capacity
```
### Insertion
Inserting text at the cursor's position means filling up the gap in the middle
of the buffer. To do so we must first make sure that the gap is big enough, or
grow the buffer accordingly.
Then inserting the text is simply a matter of copying its characters in place,
and moving the start of the gap further right.
```python
def insert(self, val: str) -> None:
# Ensure we have enouh space to insert the whole string
if len(val) > self.gap_length:
self.grow(max(self.capacity * 2, self.string_length + len(val)))
# Fill the gap with the given string
self._buf[self._gap_start : self._gap_start + len(val)] = val
self._gap_start += len(val)
```
### Deletion
Removing text from the buffer simply expands the gap in the corresponding
direction, shortening the string's prefix/suffix. This makes it very cheap.
The methods are named after the `backspace` and `delete` keys on the keyboard.
```python
def backspace(self, dist: int = 1) -> None:
assert dist <= self.prefix_length
# Extend gap to the left
self._gap_start -= dist
def delete(self, dist: int = 1) -> None:
assert dist <= self.suffix_length
# Extend gap to the right
self._gap_end += dist
```
### Moving the cursor
Moving the cursor along the buffer will shift letters from one side of the gap
to the other, moving them accross from prefix to suffix and back.
I find Python's list slicing not quite as elegant to read as a `memmove`, though
it does make for a very small and efficient implementation.
```python
def left(self, dist: int = 1) -> None:
assert dist <= self.prefix_length
# Shift the needed number of characters from end of prefix to start of suffix
self._buf[self._gap_end - dist : self._gap_end] = self._buf[
self._gap_start - dist : self._gap_start
]
# Adjust indices accordingly
self._gap_start -= dist
self._gap_end -= dist
def right(self, dist: int = 1) -> None:
assert dist <= self.suffix_length
# Shift the needed number of characters from start of suffix to end of prefix
self._buf[self._gap_start : self._gap_start + dist] = self._buf[
self._gap_end : self._gap_end + dist
]
# Adjust indices accordingly
self._gap_start += dist
self._gap_end += dist
```

View file

@ -0,0 +1,97 @@
---
title: "Bloom Filter"
date: 2024-07-14T17:46:40+01:00
draft: false # I don't care for draft mode, git has branches for that
description: "Probably cool"
tags:
- algorithms
- data structures
- python
categories:
- programming
series:
- Cool algorithms
favorite: false
disable_feed: false
---
The [_Bloom Filter_][wiki] is a probabilistic data structure for set membership.
The filter can be used as an inexpensive first step when querying the actual
data is quite costly (e.g: as a first check for expensive cache lookups or large
data seeks).
[wiki]: https://en.wikipedia.org/wiki/Bloom_filter
<!--more-->
## What does it do?
A _Bloom Filter_ can be understood as a hash-set which can either tell you:
* An element is _not_ part of the set.
* An element _may be_ part of the set.
More specifically, one can tweak the parameters of the filter to make it so that
the _false positive_ rate of membership is quite low.
I won't be going into those calculations here, but they are quite trivial to
compute, or one can just look up appropriate values for their use case.
## Implementation
I'll be using Python, which has the nifty ability of representing bitsets
through its built-in big integers quite easily.
We'll be assuming a `BIT_COUNT` of 64 here, but the implementation can easily be
tweaked to use a different number, or even change it at construction time.
### Representation
A `BloomFilter` is just a set of bits and a list of hash functions.
```python
BIT_COUNT = 64
class BloomFilter[T]:
_bits: int
_hash_functions: list[Callable[[T], int]]
def __init__(self, hash_functions: list[Callable[[T], int]]) -> None:
# Filter is initially empty
self._bits = 0
self._hash_functions = hash_functions
```
### Inserting a key
To add an element to the filter, we take the output from each hash function and
use that to set a bit in the filter. This combination of bit will identify the
element, which we can use for lookup later.
```python
def insert(self, val: T) -> None:
# Iterate over each hash
for f in self._hash_functions:
n = f(val) % BIT_COUNT
# Set the corresponding bit
self._bit |= 1 << n
```
### Querying a key
Because the _Bloom Filter_ does not actually store its elements, but some
derived data from hashing them, it can only definitely say if an element _does
not_ belong to it. Otherwise, it _may_ be part of the set, and should be checked
against the actual underlying store.
```python
def may_contain(self, val: T) -> bool:
for f in self._hash_functions:
n = f(val) % BIT_COUNT
# If one of the bits is unset, the value is definitely not present
if not (self._bit & (1 << n)):
return False
# All bits were matched, `val` is likely to be part of the set
return True
```