haystack icon indicating copy to clipboard operation
haystack copied to clipboard

Filter by list in Elasticsearch retrieval

Open Krak91 opened this issue 3 years ago • 3 comments

Hi all. My use-case is that I'm trying to filter documents on ES that have a list field (or in fact, a keyword field with multiple values) and I want to hit only the documents that have the same combination of values in that field as the list provided.

For example, let's say there are 4 documents:

doc1 = {"myfield": [x, y , z]}
doc2 = {"myfield": [x, y]}
doc3 = {"myfield": [x, z]}
doc4 = {"myfield": [x]}

filter = {"myfield": {"$eq": [x, y]}}

In this case the query should only match doc2. Doc1 and doc4 should not be matched because they have more values, and less values, respectively. An doc3 has different values.

It seems that the "$eq" operator with a list as value is a special case that is not supported in the ElasticsearchDocumentStore. The "$in" operator is not suitable either. I don't fully understand how Elasticsearch fields work but the documentation says that any field can have any number of values, like a list, but it seems that Haystack cannot query fields with more than one value.

I wonder if it is possible to achieve this effect by using the current haystack implementation. If so, I'd appreciate any help.

One solution that I've found in the ES documentation is the terms set using the minimum_should_match_script as shown here: https://www.elastic.co/guide/en/elasticsearch/reference/7.10/query-dsl-terms-set-query.html#terms-set-query-script

I've tried replacing the assertion inside document_stores.filter_utils.EqOperation.convert_to_elasticsearch() (line 421) with this code:

if isinstance(self.comparison_value, list):
            return {
                "terms_set": {
                    self.field_name: {
                        "terms": self.comparison_value,
                        "minimum_should_match_script": {
                            "source": f"Math.max(params.num_terms, doc['{self.field_name}'].size())"
            }}}}

but I must be doing something wrong because it doesn't work...

Thanks in advance! :)

Krak91 avatar Jun 16 '22 19:06 Krak91

@Krak91 Apologies for my late reply, I haven't had the chance to work on the issue yet but it's on my list!

masci avatar Jun 22 '22 17:06 masci

Hi @masci. Appreciate your response. Also, I think the query style is amazing, love it.

The snippet I added above actually works after all. I've made a pull request: https://github.com/deepset-ai/haystack/pull/2675 but I haven't had much time to test it thoroughly and also I think there are other places in the code that might need adjustment too to keep things consistent. For example I'm not sure if the implicit filter = {"myfield": [x, y]} will work the same way.

Krak91 avatar Jun 22 '22 17:06 Krak91

@Krak91 about the example filter = {"myfield": [x, y]}, that would default to in (see this comment) so I guess we're good with your proposal.

masci avatar Aug 08 '22 13:08 masci