vespa icon indicating copy to clipboard operation
vespa copied to clipboard

Support for relaxed query terms against indexed fields

Open mattweber opened this issue 2 years ago • 2 comments

Is your feature request related to a problem? Please describe.

I am unable to use relaxed query matching such as wildcards, regular expressions, and fuzzy matching against indexed fields and inside their related query operators such as phrase/near/onear.

Describe the solution you'd like

  • Add support for relaxed query terms against indexed fields.
  • Ideally support analysis where possible (wildcard terms can be lowercased, have character normalization, etc).
  • Add the ability to set the max number of term expansions where hitting that limit will be an error or a signal to stop expanding.
  • Be smart about duplicated relaxed terms, for example ((a* NEAR b) OR (a* ONEAR c)) will only perform the expensive term dictionary scan for a* once.

Describe alternatives you've considered

  • Using attributes which are in-memory and don't work with phrase/near/onear.
  • Doing the expansion outside of the engine or in a query component.

Additional context Wildcards and proximity operations against large free text fields is a very popular scenario in enterprise search use-cases.

mattweber avatar Feb 09 '23 19:02 mattweber

Refer to https://swtch.com/~rsc/regexp/regexp4.html

bratseth avatar Mar 13 '23 10:03 bratseth

We have a similar situation where fuzzy matching an (array<string>) indexed field in streaming mode would be very convenient.

If we just try fuzzy matching we can see that

"FUZZY(waste management,1,0,false) toc_label:waste management field is not a string attribute"

So then we could try using gram matching, however

n-gram matching is not supported for streaming search

We could try substring/prefix matching, which is a slight improvement but still doesn't handle typos.

So then our only other option currently is a synthetic string attribute field stored outside the document:

field myStringArrayAttribute type array<string> {
    indexing: input myStringArray | attribute
}

But then the string field would be stored in memory, significantly increasing memory resources and defeating the point of using streaming mode. Is that understanding correct?

It would be great if Vespa could support an option to help us in this situation:

  1. n-gram support in streaming mode. I've opened an issue here: https://github.com/vespa-engine/vespa/issues/33051
  2. fuzzy matching support for indexed fields.

Alexander-Mark avatar Feb 06 '25 02:02 Alexander-Mark