Cytopia icon indicating copy to clipboard operation
Cytopia copied to clipboard

Implement a Patricia Trie data structure

Open ghost opened this issue 4 years ago • 3 comments

A Patricia Trie is a data structure that is optimal for string autocompletion. Cytopia could greatly benefit from such a data structure. Notably:

  • In a search engine with recommendations
  • In a command console with autocompletion

Interface

For the Trie, we would like the following API:

  • ::Add(String Key, ValueType value): This will add a value node to the Key by traversing down the Trie. If the key already exists, an exception should be thrown.
  • ::operator[](String Key): This will search for the exact Key and return a reference to the Value. If the Key doesn't exist, it should throw an exception
  • ::Find(String key): This will return a Cursor object that represents a position in the Trie
  • A default ctor
  • An appropriate dtor For the Cursor object, we would like the following API:
  • ::begin(): Returns an iterator object to the first "leaf" Cursor that exists in the subtree.
  • ::end(): Returns an iterator to a Key that doesn't exist (the "null" cursor)
  • ::advance(String): Should set the Cursor object to the result of adding more characters to the current Cursor To understand better the mentioned API, here is a simple scenario on how we could use it:
Trie<SearchTag> m_Tags;
m_Tags.Add("police", SearchTag{...})
m_Tags.Add("pool", SearchTag{...})
m_Tags.Add("house", SearchTag{...})
/* At this point, the Trie should look like
                            o Root
                "po"  /     \  "house"
                      o          o
                    /   \
        "lice" /        \ "ol"
               o           o
(Sorry if this looks broken...)
*/
m_Tags.Find("pa"); // This should return null cursor (not actually null)
Cursor query = m_Tags.Find("p"); // This should return the Root
std::vector recommendations { query.begin(), query.end() }; // This should only contain "police" and "pool" KVPs
query.advance("o") // This should return the "po" node
auto It = query.begin(); // This is "police"
++It; // Now it's "pool"
*It; // This returns a KVP : "std::pair<String, ValueType>
++It; // Now it's "null" cursor
++It; // still "null" cursor

Implementation

I have a couple of recommendations to implement this.

  • Because it's an n-ary tree, use std::unique_ptr to represent child nodes
  • Since it's a Patricia Tree, each node should only contain the characters that follow through the transition. This means our string could be only a single character
  • For the reason mentioned above, there is a clear advantage to use a memory pool over the heap. Since that's not a high priority, I recommend that you use a Policy Allocator type in the class template arguments. It should default to std::allocator and use that allocator over new
  • There is a choice that needs to be made for how to store transitions. You could either use a std::unordered_map or just std::array since characters are integer types. Either way, your key must be a single char. There are advantages and drawbacks for both.
  • Because transitions only hold a single character, whenever you go down through a transition, you must make sure that the next node either stores a single character, or if multiple, you will need to make sure your query matches all these characters.
  • A Cursor should hold a Node, the query string, and a size_t to the first unverified character. For example with the previous tree, the Cursor at "root" would hold "p" with unverified index = 0 because the transition "po" required more than a single character
  • An Iterator object must hold a stack of the previous parents. For example the previous tree, query.begin() was "police" but holds a stack containing the "po" node. If there were more subparents in the query, then the stack would be used the same way we need a stack in Depth-first search
  • An Iterator object must also hold the current Node
  • When adding KVPs to the Trie, you must go down the Trie using transitions and validating them until you can't anymore. Once this happens, you must think about your situation: If there is no transition containing the next unverified character, then you can create the transition and flush the remaining unverified characters. If there exists another transition but it doesn't completely match your Key, then you must find the largest matching prefix of those two transitions and split the node in two. You will probably need the Cursor interface and helper functions to achieve this.
  • I don't think we need to be able to remove KVPs. If you think you know how to do implement this and it wouldn't be to hard, then you are free to do so

ghost avatar Sep 11 '19 18:09 ghost

I found an implementation here. https://www.geeksforgeeks.org/trie-insert-and-search/ what do you think about it ?? @Ercadio

Mograbi avatar Mar 09 '20 15:03 Mograbi

This is low priority. Should be after v0.3

ghost avatar May 05 '20 22:05 ghost

@Mograbi This implementation looks good. But it needs improvement and to follow the requirements listed above

ghost avatar May 05 '20 22:05 ghost