manticoresearch
manticoresearch copied to clipboard
Request: Option to skip bigrams indexing across sentence boundary
Is your feature request related to a problem? Please describe.
On a standard index, can use phrase_boundary
config to stop normal phrase queries from matching across sentence boundaries.
But when using bigram_index
, the indexed bigrams (in the dictionary) include bigrams across sentence boundaries. So now phrase searches cross boundaries.
Describe the solution you'd like
For a way to have bigram indexing honour phrase_boundary
setting, and hence NOT index bigrams that are seperated by chars in phrase_boundary
.
bigram_index = non_boundary
I guess, ideal would be to use with first_freq
or both_freq
but personally only want it with all
. So non_boundary
would mean 'all except phrase_boundary seperated pairs'
I suppose it could also be based in index_sp
, but frankly less ideal, as it not as configurable. (eg can't make comma a phrase boundary!) ... and its based on the html stipper, which may not be needed or wanted.
Describe alternatives you've considered
Could perhaps use regex_filter (or similar) to actually inject a (fake) word in the middle, to upset bigram indexing?
regexp_filter = \b;\s => \t_SEP\t
or somehting (untested!)
Additional context
https://forum.manticoresearch.com/t/preventing-bigrams-across-sentence-boundary/799/2