Filter by list in Elasticsearch retrieval
Hi all. My use-case is that I'm trying to filter documents on ES that have a list field (or in fact, a keyword field with multiple values) and I want to hit only the documents that have the same combination of values in that field as the list provided.
For example, let's say there are 4 documents:
doc1 = {"myfield": [x, y , z]}
doc2 = {"myfield": [x, y]}
doc3 = {"myfield": [x, z]}
doc4 = {"myfield": [x]}
filter = {"myfield": {"$eq": [x, y]}}
In this case the query should only match doc2. Doc1 and doc4 should not be matched because they have more values, and less values, respectively. An doc3 has different values.
It seems that the "$eq" operator with a list as value is a special case that is not supported in the ElasticsearchDocumentStore. The "$in" operator is not suitable either. I don't fully understand how Elasticsearch fields work but the documentation says that any field can have any number of values, like a list, but it seems that Haystack cannot query fields with more than one value.
I wonder if it is possible to achieve this effect by using the current haystack implementation. If so, I'd appreciate any help.
One solution that I've found in the ES documentation is the terms set using the minimum_should_match_script as shown here: https://www.elastic.co/guide/en/elasticsearch/reference/7.10/query-dsl-terms-set-query.html#terms-set-query-script
I've tried replacing the assertion inside document_stores.filter_utils.EqOperation.convert_to_elasticsearch() (line 421) with this code:
if isinstance(self.comparison_value, list):
return {
"terms_set": {
self.field_name: {
"terms": self.comparison_value,
"minimum_should_match_script": {
"source": f"Math.max(params.num_terms, doc['{self.field_name}'].size())"
}}}}
but I must be doing something wrong because it doesn't work...
Thanks in advance! :)
@Krak91 Apologies for my late reply, I haven't had the chance to work on the issue yet but it's on my list!
Hi @masci. Appreciate your response. Also, I think the query style is amazing, love it.
The snippet I added above actually works after all. I've made a pull request: https://github.com/deepset-ai/haystack/pull/2675 but I haven't had much time to test it thoroughly and also I think there are other places in the code that might need adjustment too to keep things consistent. For example I'm not sure if the implicit
filter = {"myfield": [x, y]} will work the same way.
@Krak91 about the example filter = {"myfield": [x, y]}, that would default to in (see this comment) so I guess we're good with your proposal.