vespa icon indicating copy to clipboard operation
vespa copied to clipboard

[Schema streaming mode] Enhence rank calculation for substring search

Open akolhun opened this issue 1 year ago • 5 comments

Is your feature request related to a problem? Please describe.

  • Given the schema as:
document test {
    field description type string {
        indexing: summary | index
        match: substring
    }
}
  • And a document is created with description=environmental
  • Then the following 2 search requests select * from test where description contains 'environment' select * from test where description contains 'env' return the matching doc with exactly the same score/relevance=0.38

Describe the solution you'd like Considering the sample above, request with search_term=environment should have have a higher score than the request with search_term=env

akolhun avatar Feb 01 '24 17:02 akolhun

Isn't it a bug? Vespa's documentation says: "...Streaming search uses the same implementation of most features in Vespa, including ranking, matching and grouping, and supports the same features...". We are working on hybrid search in streaming and we do very rely on the correct ranking. Thanks

jamesbond7 avatar Feb 02 '24 14:02 jamesbond7

Documentation is not perfect. There are a few differences. We are currently trying to reduce the gap. But there will always be some differences. Streaming search have a larger feature set especially related to matching as there we always have the raw text available. substring matching is a feature only available for streaming search. That is why improving the rank here is an enhancement, and not a bug.

baldersheim avatar Feb 02 '24 14:02 baldersheim

We will appreciate if you will be able to prioritize the issue.

jamesbond7 avatar Feb 03 '24 16:02 jamesbond7

@jamesbond7

Vespa index mode doesn't support substring, so you could not match env against environment - so this is obviously an enhancement and not a bug.

jobergum avatar Feb 06 '24 07:02 jobergum

Yes, this is a new feature, but one that makes sense. How about creating a separate rank feature ("matchAccuracy"?) that gives the term-weighted average of the closeness of the match of the term to the field? Could also potentially use it with multiple stems.

bratseth avatar Feb 07 '24 13:02 bratseth