sonic icon indicating copy to clipboard operation
sonic copied to clipboard

How sub-strings search work?

Open rjurado01 opened this issue 4 years ago • 5 comments

I am currently using the ruby-client following the README examples.

I have added this data to index: ingest.push('users', 'all', 1, 'Alexander Tipugin')

Then I make this searches:

  • alexander: :heavy_check_mark:
  • alex :heavy_check_mark:
  • lexander (removed first character): :heavy_check_mark:
  • exander (removed 2 first characters): :x:
  • exander Tipugin: :x:

What is the rule for substring search?

rjurado01 avatar Aug 22 '19 08:08 rjurado01

See https://github.com/valeriansaliou/sonic/issues/166#issuecomment-517920358 🙂

CvX avatar Aug 22 '19 09:08 CvX

Wow, so Sonic only support search by start-of-word. I think this sould be added to README documentation.

Thanks.

rjurado01 avatar Aug 22 '19 10:08 rjurado01

I wonder if searching by end-of-word can be solved by creating an index containing each word written with its letters in reverse order: rednaxela, as in the above example. This index would be used for cases when the wildcard is at the beginning of the word: *exander, as in the above example.

claudius108 avatar Jan 21 '20 07:01 claudius108

That would indeed work yes!

valeriansaliou avatar Jan 21 '20 08:01 valeriansaliou

I think that searching with trailing wildcard can be done in sonic as follows (maybe this is already done or old thing for others): Say we have 4 words in 4 documents: "acu" in "document0", "acum" in "document1", "acumula" in "document2", and "acumulator" in "document3". The normal index would be:

  • "acu" -> "document0";
  • "acum" -> "document1";
  • "acumula" -> "document2";
  • "acumulator" -> "document3". The index for training wildcard search would be:
  • "acu" -> "document0, document1, document2, document3";
  • "acum" -> "document1, document2, document3";
  • "acumula" -> ", document2, document3";
  • "acumulator" -> "document3".

In this way, search for "acum*" will return "document1, document2, document3", which is correct.

Of course that this ordering has to be made prior to loading the data in the DB, but this a small price to pay for the speed and small memory footprint sonic shows.

The same can be applied for leading wildcard search, with words having letters in reverse order.

For searches like "acu*la", the only thing that one has is to get the intersection of identifiers returned by the search by prefix with the ones returned by the search by suffix.

This approach would only need a new bucket for the prefix index, and another one for suffix index.

Valerian, what do you think? My use case is that I have more dictionaries I want to index, so no index data has to be updated, I only need to read the identifiers, but I need somehow complex search (the raw data I have is 3 - 5 GB).

claudius108 avatar Jan 23 '20 13:01 claudius108