hat-trie icon indicating copy to clipboard operation
hat-trie copied to clipboard

Case insensitive prefix search, data is case sensitive.

Open markg85 opened this issue 7 years ago • 6 comments

Hi,

I admit, i haven't tried it. But looking at the code it seems to be doing case-sensitive searching. Imagine i put a bunch of files paths in the trie with structures like:

/home/a/Desktop
/home/a/Downloads
/home/a/Pictures

I'm guessing a prefix search of:

tsl::htrie_set<char> set = {"/home/a/Desktop", "/home/a/Downloads", "/home/a/Pictures"};
set.equal_prefix_range("/home/a/d");

note the lowercase d in the equal_prefix_range call, it will probably not match "/home/a/Desktop" and "/home/a/Downloads" while that would be desirable. Just lowercase everything would fix this, but then the data does not represent the actual paths anymore (both pats could exists in a linux/unix world as it is case sensitive). Using a map and have a lowercase -> uppercase mapping could potentially be a solution as well, but that also doubles the data usage (at the very least) thus kinda defeating the point of using a nice HAT trie in the first place (in terms of memory efficiency).

I do realize that there is a bit of trouble in making the equal_prefix_range case-insensitive (ideally optional, case-sensitive and case-insensitive). You don't know the type i'm putting in as string. It might be a std::string, might be a std::string_view, a byte array, a QString.. You just don't know therefore can't expect a call on a character (like toLowercase() for example) to exist.

So i have a bit of a request. Could you add a function that allows me to set how to compare a prefix?

I'm thinking of a:

tsl::htrie_set::set_character_compare(...);

Then in the compare functions you use whatever is set by the user.

Again, i could be completely wrong and it's already possible, but it doesn't look like that from reading the code ;)

I'm curious about your opinion!

Best regards, Mark

markg85 avatar Dec 27 '17 01:12 markg85

Hi,

Yes currently only case-sensitive lookups are supported.

I would like to add something like the Traits template parameter in std::string, but it's difficult.

The main problem is that when I go down the trie, I go down byte by byte and not character by character. It's the same for a simple ASCII alphabet but different for multi-bytes UTF-8 characters. So even if I provide a way for the user to use a custom comparator (though in my implementation I don't directly compare two bytes, I convert a char to an unsigned int which will be used as position in the array of children), the user would only be able to compare two bytes and not two characters.

Also if I allow the trie to be case sensitive while providing case insensitive lookups, it means that lookups will have to go in different branches of the trie (possible but more complicate to implement).

I would like to add this feature but I don't really know what would be the best way to proceed to keep an intuitive interface that supports UTF-8 strings.

Tessil avatar Dec 27 '17 17:12 Tessil

Hi,

It sounds like the current way (casting char to int ad using that as array index) is a way that has to go if you want to consider case insensitive as well (or rather, any for of compare). What would that do for performance?

markg85 avatar Dec 27 '17 18:12 markg85

I tried with and boost::flat_map<char, node*> a couple months ago and the performances weren't really good (but I don't have the number any more unfortunately)

But it's possible to work that out, instead of converting the char to int you try both to_upper(char) -> int and to_lower(char) -> int paths. The main problem is really multi-bytes UTF-8 strings.

Tessil avatar Dec 27 '17 18:12 Tessil

Yeah, trying both would work. But that will cripple your performance of this nice trie container :) (or cut it in half at most i suppose)

markg85 avatar Dec 27 '17 18:12 markg85

Yeah, trying both would work. But that will cripple your performance of this nice trie container :) (or cut it in half at most i suppose)

This should not be a problem, as long as it is implemented in a separate function.

ararunaufc avatar Nov 07 '18 05:11 ararunaufc

I tried with and boost::flat_map<char, node*> a couple months ago and the performances weren't really good (but I don't have the number any more unfortunately)

But it's possible to work that out, instead of converting the char to int you try both to_upper(char) -> int and to_lower(char) -> int paths. The main problem is really multi-bytes UTF-8 strings.

UTF-8 is a mess to work with. For example, a single accented letter like "á" can be represented in at least two ways: by a single codepoint (U+00E1) or by two codepoints (U+0061 and U+0301).

It will be HARD!

ararunaufc avatar Nov 07 '18 05:11 ararunaufc