hat-trie
hat-trie copied to clipboard
Case insensitive prefix search, data is case sensitive.
Hi,
I admit, i haven't tried it. But looking at the code it seems to be doing case-sensitive searching. Imagine i put a bunch of files paths in the trie with structures like:
/home/a/Desktop
/home/a/Downloads
/home/a/Pictures
I'm guessing a prefix search of:
tsl::htrie_set<char> set = {"/home/a/Desktop", "/home/a/Downloads", "/home/a/Pictures"};
set.equal_prefix_range("/home/a/d");
note the lowercase d in the equal_prefix_range call, it will probably not match "/home/a/Desktop" and "/home/a/Downloads" while that would be desirable. Just lowercase everything would fix this, but then the data does not represent the actual paths anymore (both pats could exists in a linux/unix world as it is case sensitive). Using a map and have a lowercase -> uppercase mapping could potentially be a solution as well, but that also doubles the data usage (at the very least) thus kinda defeating the point of using a nice HAT trie in the first place (in terms of memory efficiency).
I do realize that there is a bit of trouble in making the equal_prefix_range case-insensitive (ideally optional, case-sensitive and case-insensitive). You don't know the type i'm putting in as string. It might be a std::string, might be a std::string_view, a byte array, a QString.. You just don't know therefore can't expect a call on a character (like toLowercase() for example) to exist.
So i have a bit of a request. Could you add a function that allows me to set how to compare a prefix?
I'm thinking of a:
tsl::htrie_set::set_character_compare(...);
Then in the compare functions you use whatever is set by the user.
Again, i could be completely wrong and it's already possible, but it doesn't look like that from reading the code ;)
I'm curious about your opinion!
Best regards, Mark
Hi,
Yes currently only case-sensitive lookups are supported.
I would like to add something like the Traits
template parameter in std::string
, but it's difficult.
The main problem is that when I go down the trie, I go down byte by byte and not character by character. It's the same for a simple ASCII alphabet but different for multi-bytes UTF-8 characters. So even if I provide a way for the user to use a custom comparator (though in my implementation I don't directly compare two bytes, I convert a char
to an unsigned int
which will be used as position in the array of children), the user would only be able to compare two bytes and not two characters.
Also if I allow the trie to be case sensitive while providing case insensitive lookups, it means that lookups will have to go in different branches of the trie (possible but more complicate to implement).
I would like to add this feature but I don't really know what would be the best way to proceed to keep an intuitive interface that supports UTF-8 strings.
Hi,
It sounds like the current way (casting char to int ad using that as array index) is a way that has to go if you want to consider case insensitive as well (or rather, any for of compare). What would that do for performance?
I tried with and boost::flat_map<char, node*>
a couple months ago and the performances weren't really good (but I don't have the number any more unfortunately)
But it's possible to work that out, instead of converting the char
to int
you try both to_upper(char) -> int
and to_lower(char) -> int
paths. The main problem is really multi-bytes UTF-8 strings.
Yeah, trying both would work. But that will cripple your performance of this nice trie container :) (or cut it in half at most i suppose)
Yeah, trying both would work. But that will cripple your performance of this nice trie container :) (or cut it in half at most i suppose)
This should not be a problem, as long as it is implemented in a separate function.
I tried with and
boost::flat_map<char, node*>
a couple months ago and the performances weren't really good (but I don't have the number any more unfortunately)But it's possible to work that out, instead of converting the
char
toint
you try bothto_upper(char) -> int
andto_lower(char) -> int
paths. The main problem is really multi-bytes UTF-8 strings.
UTF-8 is a mess to work with. For example, a single accented letter like "á" can be represented in at least two ways: by a single codepoint (U+00E1) or by two codepoints (U+0061 and U+0301).
It will be HARD!