blog/content/posts/2024-06-30-trie/index.md

---
title: "Trie"
date: 2024-06-30T11:07:49+01:00
draft: false # I don't care for draft mode, git has branches for that
description: "A cool map"
tags:
  - algorithms
  - data structures
  - python
categories:
  - programming
series:
  - Cool algorithms
favorite: false
disable_feed: false
---

This time, let's talk about the [_Trie_][wiki], which is a tree-based mapping
structure most often used for string keys.

[wiki]: https://en.wikipedia.org/wiki/Trie

<!--more-->

## What does it do?

A _Trie_ can be used to map a set of string keys to their corresponding values,
without the need for a hash function. This also means you won't suffer from hash
collisions, though the tree-based structure will probably translate to slower
performance than a good hash table.

A _Trie_ is especially useful to represent a dictionary of words in the case of
spell correction, as it can easily be used to fuzzy match words under a given
edit distance (think [Levenshtein distance])

[Levenshtein distance]: https://en.wikipedia.org/wiki/Levenshtein_distance

## Implementation

This implementation will be in Python for exposition purposes, even though
it already has a built-in `dict`.

### Representation

Creating a new `Trie` is easy: the root node starts off empty and without any
mapped values.

```python
class Trie[T]:
    _children: dict[str, Trie[T]]
    _value: T | None

    def __init__(self):
        # Each letter is mapped to a Trie
        self._children = defaultdict(Trie)
        # If we match a full string, we store the mapped value
        self._value = None
```

We're using a `defaultdict` for the children for ease of implementation in this
post. In reality, I would encourage you exit early when you can't match a given
character.

The string key will be implicit by the position of a node in the tree: the empty
string at the root, one-character strings as its direct children, etc...

### Search

An exact match look-up is easily done: we go down the tree until we've exhausted
the key. At that point we've either found a mapped value or not.

```python
def get(self, key: str) -> T | None:
    # Have we matched the full key?
    if not key:
        # Store the `T` if mapped, `None` otherwise
        return self._value
    # Otherwise, recurse on the child corresponding to the first letter
    return self._children[key[0]].get(key[1:])
```

### Insertion

Adding a new value to the _Trie_ is similar to a key lookup, only this time we
store the new value instead of returning it.

```python
def insert(self, key: str, value: T) -> bool:
    # Have we matched the full key?
    if not key:
        # Check whether we're overwriting a previous mapping
        was_mapped = self._value is None
        # Store the corresponding value
        self._value = value
        # Return whether we've performed an overwrite
        return was_mapped
      # Otherwise, recurse on the child corresponding to the first letter
      return self._children[key[0]].insert(key[1:], value)
```

### Removal

Removal should also look familiar.

```python
def remove(self, key: str) -> bool:
    # Have we matched the full key?
    if not key:
        was_mapped = self._value is None
        # Remove the value
        self._value = None
        # Return whether it was mapped
        return was_mapped
    # Otherwise, recurse on the child corresponding to the first letter
    return self._children[key[0]].remove(key[1:])
```

### Fuzzy matching

Fuzzily matching a given word is where the real difficulty is: the key is to
realize we can use the prefix-tree nature of a _Trie_ to avoid doing wasteful
work.

By leveraging the prefix visit order of the tree, we can build an iterative
Levenshtein distance matrix, in much the same way one would do so in its
[Dynamic Programming] implementation (see the [Wagner-Fisher algorithm]).

[Dynamic Programming]: https://en.wikipedia.org/wiki/Dynamic_programming
[Wagner-Fisher algorithm]: https://en.wikipedia.org/wiki/Wagner%E2%80%93Fischer_algorithm

```python
class FuzzyResult[T](NamedTuple):
    distance: int
    key: str
    value: T


def get_fuzzy(self, key: str, max_distance: int = 0) -> Iterator[FuzzyResult[T]]:
    def helper(
        current_word: str,
        node: Trie[T],
        previous_row: list[int],
    ) -> Iterator[tuple[int, T]]:
        # Iterative Levenshtein
        current_row = [previous_row[0] + 1]
        current_char = current_word[-1]
        for column, key_char in enumerate(key, start=1):
            insertion = current_row[column - 1] + 1
            deletion = previous_row[column] + 1
            replacement = previous_row[column - 1] + (key_char != current_char)
            current_row.append(min(insertion, deletion, replacement))

        # If we are under the max distance, match this node
        if (distance := current_row[-1]) <= max_distance and node._value != None:
            # Only if it has a value of course
            yield FuzzyResult(distance, current_word, node._value)

        # If we can potentially still match children, recurse
        if min(current_row) <= max_distance:
            for c, child in node._children.items():
                yield from helper(current_word + c, child, current_row)

    # Build the first row -- the edit distance from the empty string
    row = list(range(len(key) + 1))

    # Base case for the empty string
    if (distance := row[-1]) <= max_distance and self._value != None:
        yield FuzzyResult(distance, "", self._value)
    for c, child in self._children.items():
        yield from helper(c, child, row)
```
posts: add trie 2024-06-30 13:35:08 +02:00			`---`
			`title: "Trie"`
			`date: 2024-06-30T11:07:49+01:00`
			`draft: false # I don't care for draft mode, git has branches for that`
			`description: "A cool map"`
			`tags:`
			`- algorithms`
			`- data structures`
			`- python`
			`categories:`
			`- programming`
			`series:`
			`- Cool algorithms`
			`favorite: false`
			`disable_feed: false`
			`---`

			`This time, let's talk about the [_Trie_][wiki], which is a tree-based mapping`
			`structure most often used for string keys.`

			`[wiki]: https://en.wikipedia.org/wiki/Trie`

			`<!--more-->`
posts: trie: add presentation 2024-06-30 13:36:17 +02:00
			`## What does it do?`

			`A _Trie_ can be used to map a set of string keys to their corresponding values,`
			`without the need for a hash function. This also means you won't suffer from hash`
			`collisions, though the tree-based structure will probably translate to slower`
			`performance than a good hash table.`

			`A _Trie_ is especially useful to represent a dictionary of words in the case of`
			`spell correction, as it can easily be used to fuzzy match words under a given`
			`edit distance (think [Levenshtein distance])`

			`[Levenshtein distance]: https://en.wikipedia.org/wiki/Levenshtein_distance`
posts: trie: add construction 2024-06-30 13:36:42 +02:00
			`## Implementation`

			`This implementation will be in Python for exposition purposes, even though`
			it already has a built-in `dict`.

			`### Representation`

			Creating a new `Trie` is easy: the root node starts off empty and without any
			`mapped values.`

			```python
			`class Trie[T]:`
			`_children: dict[str, Trie[T]]`
			`_value: T \| None`

			`def __init__(self):`
			`# Each letter is mapped to a Trie`
			`self._children = defaultdict(Trie)`
			`# If we match a full string, we store the mapped value`
			`self._value = None`
			```

			We're using a `defaultdict` for the children for ease of implementation in this
			`post. In reality, I would encourage you exit early when you can't match a given`
			`character.`

			`The string key will be implicit by the position of a node in the tree: the empty`
			`string at the root, one-character strings as its direct children, etc...`
posts: trie: add search 2024-06-30 13:37:04 +02:00
			`### Search`

			`An exact match look-up is easily done: we go down the tree until we've exhausted`
			`the key. At that point we've either found a mapped value or not.`

			```python
			`def get(self, key: str) -> T \| None:`
			`# Have we matched the full key?`
			`if not key:`
			# Store the `T` if mapped, `None` otherwise
			`return self._value`
			`# Otherwise, recurse on the child corresponding to the first letter`
			`return self._children[key[0]].get(key[1:])`
			```
posts: trie: add insertion 2024-06-30 13:37:21 +02:00
			`### Insertion`

			`Adding a new value to the _Trie_ is similar to a key lookup, only this time we`
			`store the new value instead of returning it.`

			```python
			`def insert(self, key: str, value: T) -> bool:`
			`# Have we matched the full key?`
			`if not key:`
			`# Check whether we're overwriting a previous mapping`
			`was_mapped = self._value is None`
			`# Store the corresponding value`
			`self._value = value`
			`# Return whether we've performed an overwrite`
			`return was_mapped`
			`# Otherwise, recurse on the child corresponding to the first letter`
			`return self._children[key[0]].insert(key[1:], value)`
			```
posts: trie: add removal 2024-06-30 13:37:48 +02:00
			`### Removal`

			`Removal should also look familiar.`

			```python
			`def remove(self, key: str) -> bool:`
			`# Have we matched the full key?`
			`if not key:`
			`was_mapped = self._value is None`
			`# Remove the value`
			`self._value = None`
			`# Return whether it was mapped`
			`return was_mapped`
			`# Otherwise, recurse on the child corresponding to the first letter`
			`return self._children[key[0]].remove(key[1:])`
			```
posts: trie: add fuzzy matching 2024-06-30 13:38:01 +02:00
			`### Fuzzy matching`

			`Fuzzily matching a given word is where the real difficulty is: the key is to`
			`realize we can use the prefix-tree nature of a _Trie_ to avoid doing wasteful`
			`work.`

			`By leveraging the prefix visit order of the tree, we can build an iterative`
			`Levenshtein distance matrix, in much the same way one would do so in its`
			`[Dynamic Programming] implementation (see the [Wagner-Fisher algorithm]).`

			`[Dynamic Programming]: https://en.wikipedia.org/wiki/Dynamic_programming`
			`[Wagner-Fisher algorithm]: https://en.wikipedia.org/wiki/Wagner%E2%80%93Fischer_algorithm`

			```python
			`class FuzzyResult[T](NamedTuple):`
			`distance: int`
			`key: str`
			`value: T`


			`def get_fuzzy(self, key: str, max_distance: int = 0) -> Iterator[FuzzyResult[T]]:`
			`def helper(`
			`current_word: str,`
			`node: Trie[T],`
			`previous_row: list[int],`
			`) -> Iterator[tuple[int, T]]:`
			`# Iterative Levenshtein`
			`current_row = [previous_row[0] + 1]`
			`current_char = current_word[-1]`
			`for column, key_char in enumerate(key, start=1):`
			`insertion = current_row[column - 1] + 1`
			`deletion = previous_row[column] + 1`
			`replacement = previous_row[column - 1] + (key_char != current_char)`
			`current_row.append(min(insertion, deletion, replacement))`

			`# If we are under the max distance, match this node`
			`if (distance := current_row[-1]) <= max_distance and node._value != None:`
			`# Only if it has a value of course`
			`yield FuzzyResult(distance, current_word, node._value)`

			`# If we can potentially still match children, recurse`
			`if min(current_row) <= max_distance:`
			`for c, child in node._children.items():`
			`yield from helper(current_word + c, child, current_row)`

			`# Build the first row -- the edit distance from the empty string`
			`row = list(range(len(key) + 1))`

			`# Base case for the empty string`
			`if (distance := row[-1]) <= max_distance and self._value != None:`
			`yield FuzzyResult(distance, "", self._value)`
			`for c, child in self._children.items():`
			`yield from helper(c, child, row)`
			```