haystack icon indicating copy to clipboard operation
haystack copied to clipboard

Filter ElasticSearch results by min_score

Open t-charura opened this issue 2 years ago • 4 comments

Problem: I want to retrieve all relevant (similar) documents from the ElasticsearchDocumentStore based on the _score using the EmbeddingRetriever (I am not using the Reader). Prior to the search, I don't know how many relevant Documents exist. To make sure, that I retrieve all relevant entries from the ElasticsearchDocumentStore I need to set top_k=10000 or higher and filter the results afterwards - only taking documents with a _score higher than x. Retrieving this many documents takes several seconds.

Solution Filtering your query results by a minimum score value is already implemented in the Python Elasticsearch client. You could add another parameter (min_score) similar to tok_k and add it to the body that you use in client.search(). See my example:

body = { "size": top_k, "min_score": min_score, "query": self._get_vector_similarity_query(query_emb, top_k) }

I changed the body form the function def query_by_embedding(...) from the file haystack/document_stores/elasticsearch.py. Now the results contain only documents that have a _score higher than min_score.

Additional context In case the user wants to filter the results by the cosine similarity metric the min_score parameter needs to be scaled appropriately before using it in the body.

t-charura avatar Mar 23 '22 12:03 t-charura

Hi @Schokomensch - do I understand correctly that you have already made these changes? Would you like to create a PR and we can have a look?

I can imagine this being a useful optional addition. So if you provide the parameter min_score it uses it, if not it defaults to the top_k. Does that make sense?

TuanaCelik avatar Mar 24 '22 11:03 TuanaCelik

Hi, @TuanaCelik, so far I implemented these changes by creating my own custom class that overwrites some class methods from the ElasticsearchDocumentStore and EmbeddingRetriever. I will create a proper PR in the beginning of next week.

Within Elasticsearch the min_score filter is applied only after the top_k (size) filter already reduces the results. Therefore, I would suggest that whenever the user provides the min_score without setting the top_k parameter, I will set top_k=10000, which is the maximum value that Elasticsearch allows for search results (if you want to set top_k>10000 you would need to paginate your search results). The default value for min_score would be 0, since Elasticsearch does not allow None or False values within the (request) body.

t-charura avatar Mar 26 '22 15:03 t-charura

@Schokomensch This sounds good to me. Once you have the PR we can have a proper look at your implementation too. When you're ready, link the PR to this Issue so that we have a nice timeline of the discussions. Looking forward to it 👍🏾

TuanaCelik avatar Mar 29 '22 12:03 TuanaCelik

@Schokomensch feel free to request a review from me and @TuanaCelik on your PR.

tstadel avatar Apr 13 '22 11:04 tstadel

+1

liorshk avatar Mar 06 '23 14:03 liorshk