k-NN
k-NN copied to clipboard
Searching for irrelevant data also returns results.
Currently (using nmslib) while searching for any vector, number of results returned are always the same which are more than the mentioned k(returning from each shard/segment)
In case of incorrect or vector not at all related to the data indexed, the behaviour remains the same. Whereas, the expected would be to return no results.
Can we have a field to accept minimum similarity score to return only results having score more than the minimum similarity score, just like the one provided in ElasticSearch
Or a range, as in how close the scores of the results being returned should be to the best matched result, so as to return less results.
You can provide a threshold of score with min_score
. Example query:
{
"min_score": ".7",
"query": {
"bool": {
"should": [
{
"script_score": {
"query": {
"neural": {
"_fulltext_vectorized": {
"query_text": "frontier",
"model_id": "DtXRZ4kBj7tu-c6vE5q_",
"k": 100
}
}
},
"script": {
"source": "_score * 1.5"
}
}
}
]
}
}
}
@juntezhang That worked! The current score for ANN search ranges from 0 - 1. Can you suggest how we can depict a minimum score in case of ANN search? We currently are facing issue while searching for data containing irrelevant results as well.
What I found is that scores below 0.4 are irrelevant. But you can run some experiments on your data and model to see what works best for you.
You can provide a threshold of score with
min_score
. Example query:Block (26 lines)
{ "min_score": ".7", "query": { "bool": { "should": [ { "script_score": { "query": { "neural": { "_fulltext_vectorized": { "query_text": "frontier", "model_id": "DtXRZ4kBj7tu-c6vE5q_", "k": 100 } } }, "script": { "source": "_score * 1.5" } } } ] } } }
The solution provided by @juntezhang see if that works for your usecase, if it doesn't @ankitas3 Please cut a feature request to support the min score functionality.
If there are multiple queries in the request, then this will operate when scores are combined for the documents. If your usecase is more complex where you want k-NN queries to have a certain min score feel-free to cut a feature request.
Also, I will move this issue to k-NN repo where the change needs to be done.
@navneet1v Currently my requirement is to minimise the result set while using kNN by removing the irrelevant results. And also to return no results in case the results are not at all relevant. For this as suggested by @juntezhang I used min_score as 0.4. I also tried testing with different min scores but that did not work well.
Here, I have been getting data even when searching for text that makes no sense(Eg: khgcvyuui) or text that does not lie in the dataset, having max scores -- 0.46 Here setting a score say 0.47 will remove results for such keywords but will also not return results when the text is correct but the whole intent of the query does not match up well with the results.
My only concern now is how should this min_score be determined. Or if there is any other way to achieve this.
@navneet1v @juntezhang We are facing the same issue with OpenSearch neural search and in production environment, users are continuously adding more records and that may change minimum desired score continuously. So, there is no way to come up with a standard minimum score. Any solution to this problem would be greatly appreciated, as I am sure many customers are leaning towards neural plugin!
@savanbthakkar even with neural plugin this experience of irrelevant results will come because neural underline uses K-NN plugin.
There is no in build solution for this as of now. I would recommend converting this issue as a feature request, so that it can picked up.
I believe this feature should address the problem https://github.com/opensearch-project/k-NN/issues/814. But what should the exact distance/score to be passed as filter is tricky.
Closing it for now as this issue can be resolve by min_score
radial search feature https://github.com/opensearch-project/k-NN/issues/814