haystack
haystack copied to clipboard
Filter ElasticSearch results by min_score
Problem:
I want to retrieve all relevant (similar) documents from the ElasticsearchDocumentStore
based on the _score
using the EmbeddingRetriever
(I am not using the Reader). Prior to the search, I don't know how many relevant Documents exist. To make sure, that I retrieve all relevant entries from the ElasticsearchDocumentStore
I need to set top_k=10000
or higher and filter the results afterwards - only taking documents with a _score
higher than x. Retrieving this many documents takes several seconds.
Solution
Filtering your query results by a minimum score value is already implemented in the Python Elasticsearch client. You could add another parameter (min_score
) similar to tok_k
and add it to the body that you use in client.search()
. See my example:
body = { "size": top_k, "min_score": min_score, "query": self._get_vector_similarity_query(query_emb, top_k) }
I changed the body form the function def query_by_embedding(...)
from the file haystack/document_stores/elasticsearch.py. Now the results contain only documents that have a _score
higher than min_score
.
Additional context In case the user wants to filter the results by the cosine similarity metric the min_score parameter needs to be scaled appropriately before using it in the body.
Hi @Schokomensch - do I understand correctly that you have already made these changes? Would you like to create a PR and we can have a look?
I can imagine this being a useful optional addition. So if you provide the parameter min_score
it uses it, if not it defaults to the top_k
. Does that make sense?
Hi, @TuanaCelik, so far I implemented these changes by creating my own custom class that overwrites some class methods from the ElasticsearchDocumentStore and EmbeddingRetriever. I will create a proper PR in the beginning of next week.
Within Elasticsearch the min_score
filter is applied only after the top_k (size)
filter already reduces the results.
Therefore, I would suggest that whenever the user provides the min_score
without setting the top_k
parameter, I will set top_k=10000
, which is the maximum value that Elasticsearch allows for search results (if you want to set top_k>10000
you would need to paginate your search results). The default value for min_score
would be 0, since Elasticsearch does not allow None
or False
values within the (request) body.
@Schokomensch This sounds good to me. Once you have the PR we can have a proper look at your implementation too. When you're ready, link the PR to this Issue so that we have a nice timeline of the discussions. Looking forward to it 👍🏾
@Schokomensch feel free to request a review from me and @TuanaCelik on your PR.
+1