Add Bloom Filter post

posts: bloom-filter: add lookup
posts: bloom-filter: add insertion
2024-07-14 17:57:36 +01:00 · 2024-07-14 17:57:04 +01:00 · 2024-07-14 17:56:19 +01:00 · 2024-07-14 17:55:15 +01:00 · 2024-07-14 17:54:59 +01:00 · 2024-07-14 17:54:27 +01:00
2 changed files with 165 additions and 0 deletions
--- a/content/posts/2024-07-06-gap-buffer/index.md
+++ b/content/posts/2024-07-06-gap-buffer/index.md
@ -121,3 +121,71 @@ def grow(self, capacity: int) -> None:
    self._buf = new_buf
    self._gap_end += added_capacity
 ```
+
+### Insertion
+
+Inserting text at the cursor's position means filling up the gap in the middle
+of the buffer. To do so we must first make sure that the gap is big enough, or
+grow the buffer accordingly.
+
+Then inserting the text is simply a matter of copying its characters in place,
+and moving the start of the gap further right.
+
+```python
+def insert(self, val: str) -> None:
+    # Ensure we have enouh space to insert the whole string
+    if len(val) > self.gap_length:
+        self.grow(max(self.capacity * 2, self.string_length + len(val)))
+    # Fill the gap with the given string
+    self._buf[self._gap_start : self._gap_start + len(val)] = val
+    self._gap_start += len(val)
+```
+
+### Deletion
+
+Removing text from the buffer simply expands the gap in the corresponding
+direction, shortening the string's prefix/suffix. This makes it very cheap.
+
+The methods are named after the `backspace` and `delete` keys on the keyboard.
+
+```python
+def backspace(self, dist: int = 1) -> None:
+    assert dist <= self.prefix_length
+    # Extend gap to the left
+    self._gap_start -= dist
+
+def delete(self, dist: int = 1) -> None:
+    assert dist <= self.suffix_length
+    # Extend gap to the right
+    self._gap_end += dist
+```
+
+### Moving the cursor
+
+Moving the cursor along the buffer will shift letters from one side of the gap
+to the other, moving them accross from prefix to suffix and back.
+
+I find Python's list slicing not quite as elegant to read as a `memmove`, though
+it does make for a very small and efficient implementation.
+
+```python
+def left(self, dist: int = 1) -> None:
+    assert dist <= self.prefix_length
+    # Shift the needed number of characters from end of prefix to start of suffix
+    self._buf[self._gap_end - dist : self._gap_end] = self._buf[
+        self._gap_start - dist : self._gap_start
+    ]
+    # Adjust indices accordingly
+    self._gap_start -= dist
+    self._gap_end -= dist
+
+def right(self, dist: int = 1) -> None:
+    assert dist <= self.suffix_length
+    # Shift the needed number of characters from start of suffix to end of prefix
+    self._buf[self._gap_start : self._gap_start + dist] = self._buf[
+        self._gap_end : self._gap_end + dist
+    ]
+    # Adjust indices accordingly
+    self._gap_start += dist
+    self._gap_end += dist
+```
--- a/content/posts/2024-07-14-bloom-filter/index.md
+++ b/content/posts/2024-07-14-bloom-filter/index.md
@ -0,0 +1,97 @@
+---
+title: "Bloom Filter"
+date: 2024-07-14T17:46:40+01:00
+draft: false # I don't care for draft mode, git has branches for that
+description: "Probably cool"
+tags:
+  - algorithms
+  - data structures
+  - python
+categories:
+  - programming
+series:
+- Cool algorithms
+favorite: false
+disable_feed: false
+---
+
+The [_Bloom Filter_][wiki] is a probabilistic data structure for set membership.
+
+The filter can be used as an inexpensive first step when querying the actual
+data is quite costly (e.g: as a first check for expensive cache lookups or large
+data seeks).
+
+[wiki]: https://en.wikipedia.org/wiki/Bloom_filter
+
+<!--more-->
+
+## What does it do?
+
+A _Bloom Filter_ can be understood as a hash-set which can either tell you:
+
+* An element is _not_ part of the set.
+* An element _may be_ part of the set.
+
+More specifically, one can tweak the parameters of the filter to make it so that
+the _false positive_ rate of membership is quite low.
+
+I won't be going into those calculations here, but they are quite trivial to
+compute, or one can just look up appropriate values for their use case.
+
+## Implementation
+
+I'll be using Python, which has the nifty ability of representing bitsets
+through its built-in big integers quite easily.
+
+We'll be assuming a `BIT_COUNT` of 64 here, but the implementation can easily be
+tweaked to use a different number, or even change it at construction time.
+
+### Representation
+
+A `BloomFilter` is just a set of bits and a list of hash functions.
+
+```python
+BIT_COUNT = 64
+
+class BloomFilter[T]:
+    _bits: int
+    _hash_functions: list[Callable[[T], int]]
+
+    def __init__(self, hash_functions: list[Callable[[T], int]]) -> None:
+        # Filter is initially empty
+        self._bits = 0
+        self._hash_functions = hash_functions
+```
+
+### Inserting a key
+
+To add an element to the filter, we take the output from each hash function and
+use that to set a bit in the filter. This combination of bit will identify the
+element, which we can use for lookup later.
+
+```python
+def insert(self, val: T) -> None:
+    # Iterate over each hash
+    for f in self._hash_functions:
+        n = f(val) % BIT_COUNT
+        # Set the corresponding bit
+        self._bit |= 1 << n
+```
+
+### Querying a key
+
+Because the _Bloom Filter_ does not actually store its elements, but some
+derived data from hashing them, it can only definitely say if an element _does
+not_ belong to it. Otherwise, it _may_ be part of the set, and should be checked
+against the actual underlying store.
+
+```python
+def may_contain(self, val: T) -> bool:
+    for f in self._hash_functions:
+        n = f(val) % BIT_COUNT
+        # If one of the bits is unset, the value is definitely not present
+        if not (self._bit & (1 << n)):
+            return False
+    # All bits were matched, `val` is likely to be part of the set
+    return True
+```
Author	SHA1	Message	Date
Bruno BELANYI	763ee444d4	Add Bloom Filter post All checks were successful ci/woodpecker/push/deploy/1 Pipeline was successful Details ci/woodpecker/push/deploy/2 Pipeline was successful Details ci/woodpecker/cron/deploy/2 Pipeline was successful Details	2024-07-14 17:57:36 +01:00
Bruno BELANYI	5e3ba4fb04	posts: bloom-filter: add lookup	2024-07-14 17:57:04 +01:00
Bruno BELANYI	0030310952	posts: bloom-filter: add insertion	2024-07-14 17:56:19 +01:00
Bruno BELANYI	dda444bdc0	posts: bloom-filter: add construction	2024-07-14 17:55:15 +01:00
Bruno BELANYI	aea5587742	posts: bloom-filter: add presentation	2024-07-14 17:54:59 +01:00
Bruno BELANYI	c13abdc134	Add Gap Buffer post	2024-07-14 17:54:27 +01:00
Bruno BELANYI	987078068f	posts: add bloom-filter	2024-07-14 17:54:27 +01:00
Bruno BELANYI	f0b3c77862	posts: gap-buffer: add movement	2024-07-14 17:54:27 +01:00
Bruno BELANYI	6a1c074e32	posts: gap-buffer: add deletion	2024-07-14 17:54:27 +01:00
Bruno BELANYI	c413bb82a4	posts: gap-buffer: add insertion	2024-07-14 17:54:27 +01:00