vespa
vespa copied to clipboard
Support for relaxed query terms against indexed fields
Is your feature request related to a problem? Please describe.
I am unable to use relaxed query matching such as wildcards, regular expressions, and fuzzy matching against indexed fields and inside their related query operators such as phrase/near/onear.
Describe the solution you'd like
- Add support for relaxed query terms against indexed fields.
- Ideally support analysis where possible (wildcard terms can be lowercased, have character normalization, etc).
- Add the ability to set the max number of term expansions where hitting that limit will be an error or a signal to stop expanding.
- Be smart about duplicated relaxed terms, for example
((a* NEAR b) OR (a* ONEAR c))will only perform the expensive term dictionary scan fora*once.
Describe alternatives you've considered
- Using attributes which are in-memory and don't work with phrase/near/onear.
- Doing the expansion outside of the engine or in a query component.
Additional context Wildcards and proximity operations against large free text fields is a very popular scenario in enterprise search use-cases.
Refer to https://swtch.com/~rsc/regexp/regexp4.html
We have a similar situation where fuzzy matching an (array<string>) indexed field in streaming mode would be very convenient.
If we just try fuzzy matching we can see that
"FUZZY(waste management,1,0,false) toc_label:waste management field is not a string attribute"
So then we could try using gram matching, however
n-gram matching is not supported for streaming search
We could try substring/prefix matching, which is a slight improvement but still doesn't handle typos.
So then our only other option currently is a synthetic string attribute field stored outside the document:
field myStringArrayAttribute type array<string> {
indexing: input myStringArray | attribute
}
But then the string field would be stored in memory, significantly increasing memory resources and defeating the point of using streaming mode. Is that understanding correct?
It would be great if Vespa could support an option to help us in this situation:
- n-gram support in streaming mode. I've opened an issue here: https://github.com/vespa-engine/vespa/issues/33051
- fuzzy matching support for indexed fields.