Add Trie post

posts: trie: add fuzzy matching
posts: trie: add removal
2024-06-30 12:38:27 +01:00 · 2024-06-30 12:38:01 +01:00 · 2024-06-30 12:37:48 +01:00 · 2024-06-30 12:37:21 +01:00 · 2024-06-30 12:37:04 +01:00 · 2024-06-30 12:36:49 +01:00
2 changed files with 172 additions and 1 deletions
--- a/content/posts/2024-06-30-trie/index.md
+++ b/content/posts/2024-06-30-trie/index.md
@ -0,0 +1,171 @@
+---
+title: "Trie"
+date: 2024-06-30T11:07:49+01:00
+draft: false # I don't care for draft mode, git has branches for that
+description: "A cool map"
+tags:
+  - algorithms
+  - data structures
+  - python
+categories:
+  - programming
+series:
+  - Cool algorithms
+favorite: false
+disable_feed: false
+---
+
+This time, let's talk about the [_Trie_][wiki], which is a tree-based mapping
+structure most often used for string keys.
+
+[wiki]: https://en.wikipedia.org/wiki/Trie
+
+<!--more-->
+
+## What does it do?
+
+A _Trie_ can be used to map a set of string keys to their corresponding values,
+without the need for a hash function. This also means you won't suffer from hash
+collisions, though the tree-based structure will probably translate to slower
+performance than a good hash table.
+
+A _Trie_ is especially useful to represent a dictionary of words in the case of
+spell correction, as it can easily be used to fuzzy match words under a given
+edit distance (think [Levenshtein])
+
+[Levenshtein]: https://en.wikipedia.org/wiki/Levenshtein_distance
+
+## Implementation
+
+This implementation will be in Python for exposition purposes, even though
+it already has a built-in `dict`.
+
+### Representation
+
+Creating a new `Trie` is easy: the root node starts off empty and without any
+mapped values.
+
+```python
+class Trie[T]:
+    _children: dict[str, Trie[T]]
+    _value: T | None
+
+    def __init__(self):
+        # Each letter is mapped to a Trie
+        self._children = defaultdict(Trie)
+        # If we match a full string, we store the mapped value
+        self._value = None
+```
+
+We're using a `defaultdict` for the children for ease of implementation in this
+post. In reality, I would encourage you exit early when you can't match a given
+character, etc...
+
+The string key will be implicit by the position of a node in the tree: the empty
+string at the root, one-character strings as its direct children, etc...
+
+### Search
+
+An exact match look-up is easily done: we go down the tree until we've exhausted
+the key. At that point we've either found a mapped value or not.
+
+```python
+def get(self, key: str) -> T | None:
+    # Have we matched the full key?
+    if not key:
+        # Store the `T` if mapped, `None` otherwise
+        return self._value
+    # Otherwise, recurse on the child corresponding to the first letter
+    return self._children[key[0]].get(key[1:])
+```
+
+### Addition
+
+Adding a new value to the _Trie_ is similar to a key lookup, only this time we
+store the new value instead of returning it.
+
+```python
+def insert(self, key: str, value: T) -> bool:
+    # Have we matched the full key?
+    if not key:
+        # Check whether we're overwriting a previous mapping
+        was_mapped = self._value is None
+        # Store the corresponding value
+        self._value = value
+        # Return whether we've performed an overwrite
+        return was_mapped
+      # Otherwise, recurse on the child corresponding to the first letter
+      return self._children[key[0]].insert(key[1:], value)
+```
+
+### Removal
+
+Removal should also look familiar.
+
+```python
+def remove(self, key: str) -> bool:
+    # Have we matched the full key?
+    if not key:
+        was_mapped = self._value is None
+        # Remove the value
+        self._value = None
+        # Return whether it was mapped
+        return was_mapped
+    # Otherwise, recurse on the child corresponding to the first letter
+    return self._children[key[0]].remove(key[1:])
+```
+
+### Fuzzy matching
+
+Fuzzily matching a given word is where the real difficulty is: the key is to
+realize we can use the prefix-tree nature of a _Trie_ to avoid doing wasteful
+work.
+
+By leveraging the prefix visit order of the tree, we can build an iterative
+Levenshtein distance matrix, in much the same way one would do so in its
+[Dynamic Programming] implementation (see the [Wagner-Fisher algorithm]).
+
+[Dynamic Programming]: https://en.wikipedia.org/wiki/Dynamic_programming
+[Wagner-Fisher algorithm]: https://en.wikipedia.org/wiki/Wagner%E2%80%93Fischer_algorithm
+
+```python
+class FuzzyResult[T](NamedTuple):
+    distance: int
+    key: str
+    value: T
+
+
+def get_fuzzy(self, key: str, max_distance: int = 0) -> Iterator[FuzzyResult[T]]:
+    def helper(
+        current_word: str,
+        node: Trie[T],
+        previous_row: list[int],
+    ) -> Iterator[tuple[int, T]]:
+        # Iterative Levenshtein
+        current_row = [previous_row[0] + 1]
+        current_char = current_word[-1]
+        for column, key_char in enumerate(key, start=1):
+            insertion = current_row[column - 1] + 1
+            deletion = previous_row[column] + 1
+            replacement = previous_row[column - 1] + (key_char != current_char)
+            current_row.append(min(insertion, deletion, replacement))
+
+        # If we are under the max distance, match this node
+        if (distance := current_row[-1]) <= max_distance and node._value != None:
+            # Only if it has a value of course
+            yield FuzzyResult(distance, current_word, node._value)
+
+        # If we can potentially still match children, recurse
+        if min(current_row) <= max_distance:
+            for c, child in node._children.items():
+                yield from helper(current_word + c, child, current_row)
+
+    # Build the first row -- the edit distance from the empty string
+    row = list(range(len(word) + 1))
+
+    # Base case for the empty string
+    if (distance := row[-1]) <= max_distance and node._value != None:
+        yield FuzzyResult(distance, "", node._value)
+    for c, child in self._children.items():
+        yield from helper(c, node, row)
+```
--- a/themes/hugo-atom-feed
+++ b/themes/hugo-atom-feed
@ -1 +1 @@
-Subproject commit 5da913dc46d2dadcaf6548256238c58c504476de
+Subproject commit d545effed9949bf834eaed09ad423ec3e030794f
Author	SHA1	Message	Date
Bruno BELANYI	193b43db00	Add Trie post All checks were successful ci/woodpecker/push/deploy/1 Pipeline was successful Details	2024-06-30 12:38:27 +01:00
Bruno BELANYI	7708927d4b	posts: trie: add fuzzy matching	2024-06-30 12:38:01 +01:00
Bruno BELANYI	c196e87878	posts: trie: add removal	2024-06-30 12:37:48 +01:00
Bruno BELANYI	a585c48169	posts: trie: add addition	2024-06-30 12:37:21 +01:00
Bruno BELANYI	3079647900	posts: trie: add search	2024-06-30 12:37:04 +01:00
Bruno BELANYI	d073a7223c	posts: trie: add construction	2024-06-30 12:36:49 +01:00
Bruno BELANYI	d1c7374ebe	posts: trie: add presentation	2024-06-30 12:36:49 +01:00
Bruno BELANYI	8e5ac9d1e6	posts: add trie	2024-06-30 12:35:08 +01:00
Bruno BELANYI	1cde7ff539	themes: hugo-atom-feed: bump submodule	2024-06-30 12:34:08 +01:00