doris [Feature] Add NGRAM bloom filter index to speed up like queries.

Search before asking

[X] I had searched in the issues and found no similar issues.

Description

To speed up like queries we have pushed the like function to storage layer in PR #10355 , which can get 2x~3x performance gain, no matter vectorized or not. But we want to go the extra mile, and make it more faster and less resource overhead. Base on that, we are going to implement a new index for like queries.

We have researched several solutions such as pg_trgm from postgresql、ngrambf from clickhouse and FST from elasticsearch. Since Doris have bloom filter index already, in consideration of complexity、function scope and compatibility. Finally, we will choose the way as clickhouse did ngrambf_v1(n, size_of_bloom_filter_in_bytes, number_of_hash_functions, random_seed): the input column string is split into n-grams (first parameter – n-gram size), and then stored in a bloom filter. During query, the like pattern will also be split to n-grams and generate a bloom filter to do the filter, use the bloom filter to skip granule.

For doris here is the details:

Reuse the exist bloom filter index read/write process, and the storage layer will be unaffected.
Add a new kind of bloom filter index.
Add new type of algorithm: NGRAM_BLOOM_FILTER, which will extract gram and calculate the bloom filter.
For the new algorithm the HashStrategy will follow the clickhouse
Query will support index filter pages for like queries , if exist the ngram bloom filter, which based on the #10355
Support add index for history data：alter table example_db.table3 add index idx_ngrambf(username) using NGRAM_BF(3, 256) comment 'username ngram_bf index'.

That's all, thanks.

Use case

No response

Related issues

No response

Are you willing to submit PR?

[X] Yes I am willing to submit a PR!

Code of Conduct

[X] I agree to follow this project's Code of Conduct

Jul 10 '22 06:07 compasses

@compasses have you researched tokenbf_v1 in clickhouse? It's also useful for simple fulltext search. I'd like to collaborate with you and take tokenbf_v1.

Jul 24 '22 07:07 xiaokang

@compasses have you researched tokenbf_v1 in clickhouse? It's also useful for simple fulltext search. I'd like to collaborate with you and take tokenbf_v1.

Yes, for tokenbf it's work for exact match based on tokenization, maybe it's useful for english, but for Chinese I think you need other tools to do tokenization.

BTW, I think it's worth to add tokenbf. Our code will be ready soon.

Jul 24 '22 15:07 compasses

@compasses great! look forward for you pr.

Jul 25 '22 13:07 xiaokang