manticoresearch icon indicating copy to clipboard operation
manticoresearch copied to clipboard

Request: Option to skip bigrams indexing across sentence boundary

Open barryhunter opened this issue 3 years ago • 0 comments

Is your feature request related to a problem? Please describe.

On a standard index, can use phrase_boundary config to stop normal phrase queries from matching across sentence boundaries.

But when using bigram_index, the indexed bigrams (in the dictionary) include bigrams across sentence boundaries. So now phrase searches cross boundaries.

Describe the solution you'd like

For a way to have bigram indexing honour phrase_boundary setting, and hence NOT index bigrams that are seperated by chars in phrase_boundary.

bigram_index =  non_boundary  

I guess, ideal would be to use with first_freq or both_freq but personally only want it with all. So non_boundary would mean 'all except phrase_boundary seperated pairs'

I suppose it could also be based in index_sp, but frankly less ideal, as it not as configurable. (eg can't make comma a phrase boundary!) ... and its based on the html stipper, which may not be needed or wanted.

Describe alternatives you've considered

Could perhaps use regex_filter (or similar) to actually inject a (fake) word in the middle, to upset bigram indexing?

 regexp_filter = \b;\s => \t_SEP\t  

or somehting (untested!)

Additional context

https://forum.manticoresearch.com/t/preventing-bigrams-across-sentence-boundary/799/2

barryhunter avatar May 25 '21 12:05 barryhunter