Compare commits
10 commits
937cd8e730
...
763ee444d4
Author | SHA1 | Date | |
---|---|---|---|
Bruno BELANYI | 763ee444d4 | ||
Bruno BELANYI | 5e3ba4fb04 | ||
Bruno BELANYI | 0030310952 | ||
Bruno BELANYI | dda444bdc0 | ||
Bruno BELANYI | aea5587742 | ||
Bruno BELANYI | c13abdc134 | ||
Bruno BELANYI | 987078068f | ||
Bruno BELANYI | f0b3c77862 | ||
Bruno BELANYI | 6a1c074e32 | ||
Bruno BELANYI | c413bb82a4 |
|
@ -121,3 +121,71 @@ def grow(self, capacity: int) -> None:
|
|||
self._buf = new_buf
|
||||
self._gap_end += added_capacity
|
||||
```
|
||||
|
||||
### Insertion
|
||||
|
||||
Inserting text at the cursor's position means filling up the gap in the middle
|
||||
of the buffer. To do so we must first make sure that the gap is big enough, or
|
||||
grow the buffer accordingly.
|
||||
|
||||
Then inserting the text is simply a matter of copying its characters in place,
|
||||
and moving the start of the gap further right.
|
||||
|
||||
```python
|
||||
def insert(self, val: str) -> None:
|
||||
# Ensure we have enouh space to insert the whole string
|
||||
if len(val) > self.gap_length:
|
||||
self.grow(max(self.capacity * 2, self.string_length + len(val)))
|
||||
# Fill the gap with the given string
|
||||
self._buf[self._gap_start : self._gap_start + len(val)] = val
|
||||
self._gap_start += len(val)
|
||||
```
|
||||
|
||||
### Deletion
|
||||
|
||||
Removing text from the buffer simply expands the gap in the corresponding
|
||||
direction, shortening the string's prefix/suffix. This makes it very cheap.
|
||||
|
||||
The methods are named after the `backspace` and `delete` keys on the keyboard.
|
||||
|
||||
```python
|
||||
def backspace(self, dist: int = 1) -> None:
|
||||
assert dist <= self.prefix_length
|
||||
# Extend gap to the left
|
||||
self._gap_start -= dist
|
||||
|
||||
def delete(self, dist: int = 1) -> None:
|
||||
assert dist <= self.suffix_length
|
||||
# Extend gap to the right
|
||||
self._gap_end += dist
|
||||
```
|
||||
|
||||
### Moving the cursor
|
||||
|
||||
Moving the cursor along the buffer will shift letters from one side of the gap
|
||||
to the other, moving them accross from prefix to suffix and back.
|
||||
|
||||
I find Python's list slicing not quite as elegant to read as a `memmove`, though
|
||||
it does make for a very small and efficient implementation.
|
||||
|
||||
```python
|
||||
def left(self, dist: int = 1) -> None:
|
||||
assert dist <= self.prefix_length
|
||||
# Shift the needed number of characters from end of prefix to start of suffix
|
||||
self._buf[self._gap_end - dist : self._gap_end] = self._buf[
|
||||
self._gap_start - dist : self._gap_start
|
||||
]
|
||||
# Adjust indices accordingly
|
||||
self._gap_start -= dist
|
||||
self._gap_end -= dist
|
||||
|
||||
def right(self, dist: int = 1) -> None:
|
||||
assert dist <= self.suffix_length
|
||||
# Shift the needed number of characters from start of suffix to end of prefix
|
||||
self._buf[self._gap_start : self._gap_start + dist] = self._buf[
|
||||
self._gap_end : self._gap_end + dist
|
||||
]
|
||||
# Adjust indices accordingly
|
||||
self._gap_start += dist
|
||||
self._gap_end += dist
|
||||
```
|
||||
|
|
97
content/posts/2024-07-14-bloom-filter/index.md
Normal file
97
content/posts/2024-07-14-bloom-filter/index.md
Normal file
|
@ -0,0 +1,97 @@
|
|||
---
|
||||
title: "Bloom Filter"
|
||||
date: 2024-07-14T17:46:40+01:00
|
||||
draft: false # I don't care for draft mode, git has branches for that
|
||||
description: "Probably cool"
|
||||
tags:
|
||||
- algorithms
|
||||
- data structures
|
||||
- python
|
||||
categories:
|
||||
- programming
|
||||
series:
|
||||
- Cool algorithms
|
||||
favorite: false
|
||||
disable_feed: false
|
||||
---
|
||||
|
||||
The [_Bloom Filter_][wiki] is a probabilistic data structure for set membership.
|
||||
|
||||
The filter can be used as an inexpensive first step when querying the actual
|
||||
data is quite costly (e.g: as a first check for expensive cache lookups or large
|
||||
data seeks).
|
||||
|
||||
[wiki]: https://en.wikipedia.org/wiki/Bloom_filter
|
||||
|
||||
<!--more-->
|
||||
|
||||
## What does it do?
|
||||
|
||||
A _Bloom Filter_ can be understood as a hash-set which can either tell you:
|
||||
|
||||
* An element is _not_ part of the set.
|
||||
* An element _may be_ part of the set.
|
||||
|
||||
More specifically, one can tweak the parameters of the filter to make it so that
|
||||
the _false positive_ rate of membership is quite low.
|
||||
|
||||
I won't be going into those calculations here, but they are quite trivial to
|
||||
compute, or one can just look up appropriate values for their use case.
|
||||
|
||||
## Implementation
|
||||
|
||||
I'll be using Python, which has the nifty ability of representing bitsets
|
||||
through its built-in big integers quite easily.
|
||||
|
||||
We'll be assuming a `BIT_COUNT` of 64 here, but the implementation can easily be
|
||||
tweaked to use a different number, or even change it at construction time.
|
||||
|
||||
### Representation
|
||||
|
||||
A `BloomFilter` is just a set of bits and a list of hash functions.
|
||||
|
||||
```python
|
||||
BIT_COUNT = 64
|
||||
|
||||
class BloomFilter[T]:
|
||||
_bits: int
|
||||
_hash_functions: list[Callable[[T], int]]
|
||||
|
||||
def __init__(self, hash_functions: list[Callable[[T], int]]) -> None:
|
||||
# Filter is initially empty
|
||||
self._bits = 0
|
||||
self._hash_functions = hash_functions
|
||||
```
|
||||
|
||||
### Inserting a key
|
||||
|
||||
To add an element to the filter, we take the output from each hash function and
|
||||
use that to set a bit in the filter. This combination of bit will identify the
|
||||
element, which we can use for lookup later.
|
||||
|
||||
```python
|
||||
def insert(self, val: T) -> None:
|
||||
# Iterate over each hash
|
||||
for f in self._hash_functions:
|
||||
n = f(val) % BIT_COUNT
|
||||
# Set the corresponding bit
|
||||
self._bit |= 1 << n
|
||||
```
|
||||
|
||||
### Querying a key
|
||||
|
||||
Because the _Bloom Filter_ does not actually store its elements, but some
|
||||
derived data from hashing them, it can only definitely say if an element _does
|
||||
not_ belong to it. Otherwise, it _may_ be part of the set, and should be checked
|
||||
against the actual underlying store.
|
||||
|
||||
```python
|
||||
def may_contain(self, val: T) -> bool:
|
||||
for f in self._hash_functions:
|
||||
n = f(val) % BIT_COUNT
|
||||
# If one of the bits is unset, the value is definitely not present
|
||||
if not (self._bit & (1 << n)):
|
||||
return False
|
||||
# All bits were matched, `val` is likely to be part of the set
|
||||
return True
|
||||
```
|
Loading…
Reference in a new issue