neural-search icon indicating copy to clipboard operation
neural-search copied to clipboard

[FEATURE] Support for setting a default min/max score for the upcoming Normalization and Score Combination feature

Open SeyedAlirezaFatemi opened this issue 2 years ago • 3 comments

Is your feature request related to a problem?

Related to RFC. The current problem with the RFC is that when we are combining scores from different queries (e.g. BM25 and kNN), we need the min and max score of each query part. However, when using approximate kNN, we cannot accurately calculate the min score unless we do an exact kNN search on the index which is not feasible. This leads to inconsistent score normalization, particularly when using pagination.

What solution would you like?

As discussed in detail in the RFC, one solution is to rely on the statistics we get from the documents we see during the current query. However, in specific scenarios where the min score can be known, we can do better. For example, when using BM25 or Cosine similarity in kNN, the user can optionally define the min score in the query to be 0 and -1, respectively.

By allowing the user to optionally define a min/max score in the query for normalization, we can ensure consistent score normalization across different queries for specific scenarios, particularly when using pagination. This would improve the accuracy and reliability of the search results for users.

Here is an example where we have the issue of pagination inconsistency when we use the general solution: Let's assume we have a query that consists of a text match query and a kNN query and we use this formula for score normalization: x_normalized = (x – min) / (max – min) and we set the page size to 10. Assume the top 10 kNN scores are between 1 and 0.9 and then the scores for the rest of the documents fall to 0. This changes the scores after normalization drastically if we go to the next page and we might get pagination inconsistency and get missing/double results.

SeyedAlirezaFatemi avatar Apr 06 '23 13:04 SeyedAlirezaFatemi

@SeyedAlirezaFatemi Thanks for creating the issue.

navneet1v avatar Apr 06 '23 16:04 navneet1v

@navneet1v @martin-gaievski

I noticed that in the "An Analysis of Fusion Functions for Hybrid Retrieval" paper, they also mention a min-max normalization method ($𝜙_{TMM}$, Equation 4) that uses the theoretical minimum of a function. "As an example, when $𝑓_{LEX}$ is BM25, then its infimum is 0. When $𝑓_{SEM}$ is cosine similarity, then that quantity is −1."

They also mention: "Interestingly, the behavior of $𝜙_{TMM}$ appears to be more robust to the data distribution—its peak remains within a small neighborhood as we move from one dataset to another. We believe the reason $𝜙_{TMM}$-normalized scores are more stable is because it has one fewer data-dependent statistic in the transformation (i.e., minimum score in the retrieved set is replaced with minimum feasible value regardless of the candidate set)."

So It would be really nice to have this feature of defining a default min value for the normalization and get the max from the data.

SeyedAlirezaFatemi avatar Jul 19 '23 14:07 SeyedAlirezaFatemi

@SeyedAlirezaFatemi thanks for providing this info. I will look into this. We are still in the development phase of the original scope.

navneet1v avatar Jul 19 '23 16:07 navneet1v

@SeyedAlirezaFatemi, is the inconsistent pagination result the main reason for supporting this? Even with the customer-provided min/max score, the inconsistency in pagination will still occur. There's an ongoing project aimed at improving pagination consistency for hybrid search. It would be great if you could take a look at #933 and share your thoughts on whether this feature would still provide value.

heemin32 avatar Nov 20 '24 22:11 heemin32

The way we implement pagination will not eliminate the problem described in this issue. we allow user to provide the size of the window for pagination with the new parameter pagination_depth, but that window will be often smaller then the size of actually matching docs. For instance knn and neural queries will give some positive score to every document in the index. So technically this request makes sense, although I'm nor sure how often that is needed to real life use cases.

@SeyedAlirezaFatemi did you have a chance to review @heemin32 question and mentioned RFC for pagination in hybrid query https://github.com/opensearch-project/neural-search/issues/933?

martin-gaievski avatar Jan 09 '25 02:01 martin-gaievski