Cytopia
Cytopia copied to clipboard
Implement a Patricia Trie data structure
A Patricia Trie is a data structure that is optimal for string autocompletion. Cytopia could greatly benefit from such a data structure. Notably:
- In a search engine with recommendations
- In a command console with autocompletion
Interface
For the Trie, we would like the following API:
-
::Add(String Key, ValueType value)
: This will add a value node to the Key by traversing down the Trie. If the key already exists, an exception should be thrown. -
::operator[](String Key)
: This will search for the exact Key and return a reference to the Value. If the Key doesn't exist, it should throw an exception -
::Find(String key)
: This will return a Cursor object that represents a position in the Trie - A default ctor
- An appropriate dtor For the Cursor object, we would like the following API:
-
::begin()
: Returns an iterator object to the first "leaf" Cursor that exists in the subtree. -
::end()
: Returns an iterator to a Key that doesn't exist (the "null" cursor) -
::advance(String)
: Should set the Cursor object to the result of adding more characters to the current Cursor To understand better the mentioned API, here is a simple scenario on how we could use it:
Trie<SearchTag> m_Tags;
m_Tags.Add("police", SearchTag{...})
m_Tags.Add("pool", SearchTag{...})
m_Tags.Add("house", SearchTag{...})
/* At this point, the Trie should look like
o Root
"po" / \ "house"
o o
/ \
"lice" / \ "ol"
o o
(Sorry if this looks broken...)
*/
m_Tags.Find("pa"); // This should return null cursor (not actually null)
Cursor query = m_Tags.Find("p"); // This should return the Root
std::vector recommendations { query.begin(), query.end() }; // This should only contain "police" and "pool" KVPs
query.advance("o") // This should return the "po" node
auto It = query.begin(); // This is "police"
++It; // Now it's "pool"
*It; // This returns a KVP : "std::pair<String, ValueType>
++It; // Now it's "null" cursor
++It; // still "null" cursor
Implementation
I have a couple of recommendations to implement this.
- Because it's an n-ary tree, use
std::unique_ptr
to represent child nodes - Since it's a Patricia Tree, each node should only contain the characters that follow through the transition. This means our string could be only a single character
- For the reason mentioned above, there is a clear advantage to use a memory pool over the heap. Since that's not a high priority, I recommend that you use a Policy Allocator type in the class template arguments. It should default to
std::allocator
and use that allocator overnew
- There is a choice that needs to be made for how to store transitions. You could either use a
std::unordered_map
or juststd::array
since characters are integer types. Either way, your key must be a singlechar
. There are advantages and drawbacks for both. - Because transitions only hold a single character, whenever you go down through a transition, you must make sure that the next node either stores a single character, or if multiple, you will need to make sure your query matches all these characters.
- A Cursor should hold a Node, the query string, and a
size_t
to the first unverified character. For example with the previous tree, the Cursor at "root" would hold "p" with unverified index = 0 because the transition "po" required more than a single character - An Iterator object must hold a stack of the previous parents. For example the previous tree,
query.begin()
was "police" but holds a stack containing the "po" node. If there were more subparents in the query, then the stack would be used the same way we need a stack in Depth-first search - An Iterator object must also hold the current Node
- When adding KVPs to the Trie, you must go down the Trie using transitions and validating them until you can't anymore. Once this happens, you must think about your situation: If there is no transition containing the next unverified character, then you can create the transition and flush the remaining unverified characters. If there exists another transition but it doesn't completely match your Key, then you must find the largest matching prefix of those two transitions and split the node in two. You will probably need the Cursor interface and helper functions to achieve this.
- I don't think we need to be able to remove KVPs. If you think you know how to do implement this and it wouldn't be to hard, then you are free to do so
I found an implementation here. https://www.geeksforgeeks.org/trie-insert-and-search/ what do you think about it ?? @Ercadio
This is low priority. Should be after v0.3
@Mograbi This implementation looks good. But it needs improvement and to follow the requirements listed above