Compare commits

..

15 commits

Author SHA1 Message Date
Bruno BELANYI 763ee444d4 Add Bloom Filter post
All checks were successful
ci/woodpecker/push/deploy/1 Pipeline was successful
ci/woodpecker/push/deploy/2 Pipeline was successful
ci/woodpecker/cron/deploy/2 Pipeline was successful
2024-07-14 17:57:36 +01:00
Bruno BELANYI 5e3ba4fb04 posts: bloom-filter: add lookup 2024-07-14 17:57:04 +01:00
Bruno BELANYI 0030310952 posts: bloom-filter: add insertion 2024-07-14 17:56:19 +01:00
Bruno BELANYI dda444bdc0 posts: bloom-filter: add construction 2024-07-14 17:55:15 +01:00
Bruno BELANYI aea5587742 posts: bloom-filter: add presentation 2024-07-14 17:54:59 +01:00
Bruno BELANYI c13abdc134 Add Gap Buffer post 2024-07-14 17:54:27 +01:00
Bruno BELANYI 987078068f posts: add bloom-filter 2024-07-14 17:54:27 +01:00
Bruno BELANYI f0b3c77862 posts: gap-buffer: add movement 2024-07-14 17:54:27 +01:00
Bruno BELANYI 6a1c074e32 posts: gap-buffer: add deletion 2024-07-14 17:54:27 +01:00
Bruno BELANYI c413bb82a4 posts: gap-buffer: add insertion 2024-07-14 17:54:27 +01:00
Bruno BELANYI 937cd8e730 posts: gap-buffer: add growth 2024-07-14 17:54:27 +01:00
Bruno BELANYI 3ca80055e2 posts: gap-buffer: add accessors 2024-07-14 17:54:27 +01:00
Bruno BELANYI ea9fe25571 posts: gap-buffer: add construction 2024-07-14 17:54:27 +01:00
Bruno BELANYI bbcc1f97ce posts: gap-buffer: add presentation 2024-07-14 17:54:27 +01:00
Bruno BELANYI b56078f917 posts: add gap-buffer 2024-07-14 17:54:23 +01:00

View file

@ -0,0 +1,97 @@
---
title: "Bloom Filter"
date: 2024-07-14T17:46:40+01:00
draft: false # I don't care for draft mode, git has branches for that
description: "Probably cool"
tags:
- algorithms
- data structures
- python
categories:
- programming
series:
- Cool algorithms
favorite: false
disable_feed: false
---
The [_Bloom Filter_][wiki] is a probabilistic data structure for set membership.
The filter can be used as an inexpensive first step when querying the actual
data is quite costly (e.g: as a first check for expensive cache lookups or large
data seeks).
[wiki]: https://en.wikipedia.org/wiki/Bloom_filter
<!--more-->
## What does it do?
A _Bloom Filter_ can be understood as a hash-set which can either tell you:
* An element is _not_ part of the set.
* An element _may be_ part of the set.
More specifically, one can tweak the parameters of the filter to make it so that
the _false positive_ rate of membership is quite low.
I won't be going into those calculations here, but they are quite trivial to
compute, or one can just look up appropriate values for their use case.
## Implementation
I'll be using Python, which has the nifty ability of representing bitsets
through its built-in big integers quite easily.
We'll be assuming a `BIT_COUNT` of 64 here, but the implementation can easily be
tweaked to use a different number, or even change it at construction time.
### Representation
A `BloomFilter` is just a set of bits and a list of hash functions.
```python
BIT_COUNT = 64
class BloomFilter[T]:
_bits: int
_hash_functions: list[Callable[[T], int]]
def __init__(self, hash_functions: list[Callable[[T], int]]) -> None:
# Filter is initially empty
self._bits = 0
self._hash_functions = hash_functions
```
### Inserting a key
To add an element to the filter, we take the output from each hash function and
use that to set a bit in the filter. This combination of bit will identify the
element, which we can use for lookup later.
```python
def insert(self, val: T) -> None:
# Iterate over each hash
for f in self._hash_functions:
n = f(val) % BIT_COUNT
# Set the corresponding bit
self._bit |= 1 << n
```
### Querying a key
Because the _Bloom Filter_ does not actually store its elements, but some
derived data from hashing them, it can only definitely say if an element _does
not_ belong to it. Otherwise, it _may_ be part of the set, and should be checked
against the actual underlying store.
```python
def may_contain(self, val: T) -> bool:
for f in self._hash_functions:
n = f(val) % BIT_COUNT
# If one of the bits is unset, the value is definitely not present
if not (self._bit & (1 << n)):
return False
# All bits were matched, `val` is likely to be part of the set
return True
```