From f4a64b2a37a75f81438925f0598204a052afd2f8 Mon Sep 17 00:00:00 2001 From: Bruno BELANYI Date: Sun, 14 Jul 2024 17:53:25 +0100 Subject: [PATCH 1/5] posts: add bloom-filter --- .../posts/2024-07-14-bloom-filter/index.md | 26 +++++++++++++++++++ 1 file changed, 26 insertions(+) create mode 100644 content/posts/2024-07-14-bloom-filter/index.md diff --git a/content/posts/2024-07-14-bloom-filter/index.md b/content/posts/2024-07-14-bloom-filter/index.md new file mode 100644 index 0000000..98cfc1e --- /dev/null +++ b/content/posts/2024-07-14-bloom-filter/index.md @@ -0,0 +1,26 @@ +--- +title: "Bloom Filter" +date: 2024-07-14T17:46:40+01:00 +draft: false # I don't care for draft mode, git has branches for that +description: "Probably cool" +tags: + - algorithms + - data structures + - python +categories: + - programming +series: +- Cool algorithms +favorite: false +disable_feed: false +--- + +The [_Bloom Filter_][wiki] is a probabilistic data structure for set membership. + +The filter can be used as an inexpensive first step when querying the actual +data is quite costly (e.g: as a first check for expensive cache lookups or large +data seeks). + +[wiki]: https://en.wikipedia.org/wiki/Bloom_filter + + From 3992996a89dc3183c2563939cbd8de2a941cd393 Mon Sep 17 00:00:00 2001 From: Bruno BELANYI Date: Sun, 14 Jul 2024 17:54:59 +0100 Subject: [PATCH 2/5] posts: bloom-filter: add presentation --- content/posts/2024-07-14-bloom-filter/index.md | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/content/posts/2024-07-14-bloom-filter/index.md b/content/posts/2024-07-14-bloom-filter/index.md index 98cfc1e..0a82882 100644 --- a/content/posts/2024-07-14-bloom-filter/index.md +++ b/content/posts/2024-07-14-bloom-filter/index.md @@ -24,3 +24,16 @@ data seeks). [wiki]: https://en.wikipedia.org/wiki/Bloom_filter + +## What does it do? + +A _Bloom Filter_ can be understood as a hash-set which can either tell you: + +* An element is _not_ part of the set. +* An element _may be_ part of the set. + +More specifically, one can tweak the parameters of the filter to make it so that +the _false positive_ rate of membership is quite low. + +I won't be going into those calculations here, but they are quite trivial to +compute, or one can just look up appropriate values for their use case. From 798116716f528a5a439d1bc490ec1a955d548e04 Mon Sep 17 00:00:00 2001 From: Bruno BELANYI Date: Sun, 14 Jul 2024 17:55:15 +0100 Subject: [PATCH 3/5] posts: bloom-filter: add construction --- .../posts/2024-07-14-bloom-filter/index.md | 25 +++++++++++++++++++ 1 file changed, 25 insertions(+) diff --git a/content/posts/2024-07-14-bloom-filter/index.md b/content/posts/2024-07-14-bloom-filter/index.md index 0a82882..547d50f 100644 --- a/content/posts/2024-07-14-bloom-filter/index.md +++ b/content/posts/2024-07-14-bloom-filter/index.md @@ -37,3 +37,28 @@ the _false positive_ rate of membership is quite low. I won't be going into those calculations here, but they are quite trivial to compute, or one can just look up appropriate values for their use case. + +## Implementation + +I'll be using Python, which has the nifty ability of representing bitsets +through its built-in big integers quite easily. + +We'll be assuming a `BIT_COUNT` of 64 here, but the implementation can easily be +tweaked to use a different number, or even change it at construction time. + +### Representation + +A `BloomFilter` is just a set of bits and a list of hash functions. + +```python +BIT_COUNT = 64 + +class BloomFilter[T]: + _bits: int + _hash_functions: list[Callable[[T], int]] + + def __init__(self, hash_functions: list[Callable[[T], int]]) -> None: + # Filter is initially empty + self._bits = 0 + self._hash_functions = hash_functions +``` From 2c31c1aff294231f18f0d2df9a96e4c9878ae5ee Mon Sep 17 00:00:00 2001 From: Bruno BELANYI Date: Sun, 14 Jul 2024 17:55:33 +0100 Subject: [PATCH 4/5] posts: bloom-filter: add insertion --- content/posts/2024-07-14-bloom-filter/index.md | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/content/posts/2024-07-14-bloom-filter/index.md b/content/posts/2024-07-14-bloom-filter/index.md index 547d50f..1d593a7 100644 --- a/content/posts/2024-07-14-bloom-filter/index.md +++ b/content/posts/2024-07-14-bloom-filter/index.md @@ -62,3 +62,18 @@ class BloomFilter[T]: self._bits = 0 self._hash_functions = hash_functions ``` + +### Inserting a key + +To add an element to the filter, we take the output from each hash function and +use that to set a bit in the filter. This combination of bit will identify the +element, which we can use for lookup later. + +```python +def insert(self, val: T) -> None: + # Iterate over each hash + for f in self._hash_functions: + n = f(val) % BIT_COUNT + # Set the corresponding bit + self._bit |= 1 << n +``` From 27152689eaae20208cd390e980255d66b09bd0f3 Mon Sep 17 00:00:00 2001 From: Bruno BELANYI Date: Sun, 14 Jul 2024 17:56:33 +0100 Subject: [PATCH 5/5] posts: bloom-filter: add lookup --- content/posts/2024-07-14-bloom-filter/index.md | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) diff --git a/content/posts/2024-07-14-bloom-filter/index.md b/content/posts/2024-07-14-bloom-filter/index.md index 1d593a7..93107d4 100644 --- a/content/posts/2024-07-14-bloom-filter/index.md +++ b/content/posts/2024-07-14-bloom-filter/index.md @@ -77,3 +77,21 @@ def insert(self, val: T) -> None: # Set the corresponding bit self._bit |= 1 << n ``` + +### Querying a key + +Because the _Bloom Filter_ does not actually store its elements, but some +derived data from hashing them, it can only definitely say if an element _does +not_ belong to it. Otherwise, it _may_ be part of the set, and should be checked +against the actual underlying store. + +```python +def may_contain(self, val: T) -> bool: + for f in self._hash_functions: + n = f(val) % BIT_COUNT + # If one of the bits is unset, the value is definitely not present + if not (self._bit & (1 << n)): + return False + # All bits were matched, `val` is likely to be part of the set + return True +```