sonic
sonic copied to clipboard
How sub-strings search work?
I am currently using the ruby-client following the README examples.
I have added this data to index: ingest.push('users', 'all', 1, 'Alexander Tipugin')
Then I make this searches:
-
alexander
: :heavy_check_mark: -
alex
:heavy_check_mark: -
lexander
(removed first character): :heavy_check_mark: -
exander
(removed 2 first characters): :x: -
exander Tipugin
: :x:
What is the rule for substring search?
See https://github.com/valeriansaliou/sonic/issues/166#issuecomment-517920358 🙂
Wow, so Sonic only support search by start-of-word. I think this sould be added to README documentation.
Thanks.
I wonder if searching by end-of-word can be solved by creating an index containing each word written with its letters in reverse order: rednaxela, as in the above example. This index would be used for cases when the wildcard is at the beginning of the word: *exander, as in the above example.
That would indeed work yes!
I think that searching with trailing wildcard can be done in sonic as follows (maybe this is already done or old thing for others): Say we have 4 words in 4 documents: "acu" in "document0", "acum" in "document1", "acumula" in "document2", and "acumulator" in "document3". The normal index would be:
- "acu" -> "document0";
- "acum" -> "document1";
- "acumula" -> "document2";
- "acumulator" -> "document3". The index for training wildcard search would be:
- "acu" -> "document0, document1, document2, document3";
- "acum" -> "document1, document2, document3";
- "acumula" -> ", document2, document3";
- "acumulator" -> "document3".
In this way, search for "acum*" will return "document1, document2, document3", which is correct.
Of course that this ordering has to be made prior to loading the data in the DB, but this a small price to pay for the speed and small memory footprint sonic shows.
The same can be applied for leading wildcard search, with words having letters in reverse order.
For searches like "acu*la", the only thing that one has is to get the intersection of identifiers returned by the search by prefix with the ones returned by the search by suffix.
This approach would only need a new bucket for the prefix index, and another one for suffix index.
Valerian, what do you think? My use case is that I have more dictionaries I want to index, so no index data has to be updated, I only need to read the identifiers, but I need somehow complex search (the raw data I have is 3 - 5 GB).