haystack-core-integrations icon indicating copy to clipboard operation
haystack-core-integrations copied to clipboard

Metadata search fields for OpenSearch document store

Open sanjayc2 opened this issue 8 months ago • 1 comments

Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

I am building a RAG pipeline, using the OpeSearchDocumentStore as the vector store. I would like to use a custom query which allows one to "filter" by metadata fields (in addition to using the $query_embedding). In other words, I would like to use a retriever which allows me to do the usual embedding search on the content along with a full-text query search on metadata fields. The Haystack metadata filter (along with the embedding retriever) does not work for my use case due to its limited filtering functionality.

Describe the solution you'd like A clear and concise description of what you want to happen.

I am not sure about this, but I think one solution to the above problem might be to add a search_fields feature/functionality which was added to Haystack 1.0, but is not present in Haystack 2.0 (btw, when I added a search_fields argument to the OpenSearchDocumentStore with Haystack 2.0, it did not throw an exception. I think if search_fields are not allowed, an exception should be thrown). Any other solution to my problem is also welcome. I should add that using the BM25Retriever for the full-text query and joining the result of that with that from an EmbeddingRetriever would not work for my use case; I would like to be able to do the semantic search only on those document chunks that are associated with the file with a file name that matches a text string ("Coral Gold Resources" or "CoralGoldResources", in my example below), otherwise the search space is too large (there are hundreds of files to search from).

Also, please let me know of any workaround you recommend until the requested functionality (if you agree with it!) is productionized. For example, would the QdrantDocumentStore along with QdrantHybridRetriever (I just came across this) work for my situation? P.S. I tried the Qdrant fastembed hybrid search (with sparse vector embedding, specifying meta_fileds_to_embed), but the results were not so good, because the sparse embedding search also searches the context, whereas I want it to only search the metadata.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

I have tried to use search_fields (in Haystack 2.0) and tried various custom queries but none of them worked. The below embedding-based query works as expected:

custom_query_no_metadata = {
    "query": {
        "bool": {
            "must": [
                {
                    "knn": {
                        "embedding": {
                            "vector": "$query_embedding",
                            "k": 100,
                        }
                    }
                }
            ]
        }
    }
}

But the next one below does not retrieve any results. Here, the file name is a metadata field of the chunks; the file name is "CORALGOLDRESOURCES,LTD_05_28_2020-EX-4.1-CONSULTING AGREEMENT.md".

custom_query_meta_filter = {
    "query": {
        "bool": {
            "should": [
                {
                    "match": {
                        "file_path": "Coral Gold Resources"
                    }
                },
                {
                    "match":  {
                        "file_path": "CoralGoldResources"
                    }
                }
            ]
        }
    }
}

Additional context Add any other context or screenshots about the feature request here.

I am using the following document store:

OpenSearchDocumentStore(hosts="http://localhost:9200", use_ssl=True,
verify_certs=False, http_auth=awsauth, search_fields = ["title", "file_path", "content"], similarity ="cosine", embedding_dim = 1024, recreate_index=True)

I am also using the following embedders and retriever:

document_embedder = SentenceTransformersDocumentEmbedder(
        model=embed_model, device=ComponentDevice.from_str("cuda:0"),
        trust_remote_code=True,         # for embeddings like nomic-ai/nomic-embed-text-v1
        meta_fields_to_embed=meta_fields_to_embed
    )

text_embedder = SentenceTransformersTextEmbedder(model=embed_model, device=ComponentDevice.from_str("cuda:0"))

embedding_retriever = OpenSearchEmbeddingRetriever(document_store=document_store)

The query pipeline is run as below:

result = query_pipeline.run({"text_embedder": {"text": query_}, "embedding_retriever": {"custom_query": custom_query_with_metadata_filter}})

I would like to have the below custom query return chunks where the content is semantically similar to the query and the file name (which is a metadata field of the chunks) contains the text "Coral Gold Resources" or "CoralGoldResources".

custom_query_with_metadata_filter = {
    "query": {
        "bool": {
            "must": [               # must -> boolean and for each query in the list;  should -> boolean or for each query in the list
                {
                    "knn": {
                        "embedding": {
                            "vector": "$query_embedding",
                            "k": 100,
                        }
                    }
                },
                {
                    "bool": {
                        "should": [
                            {
                                "match": {
                                    "file_path": "Coral Gold Resources"
                                }
                            },
                            {
                                 "match":  {
                                    "file_path": "CoralGoldResources"
                                }
                            }
                        ]
                    }
                }
            ]
        }
    }
}

Thanks in advance for taking to the time to look into this.

sanjayc2 avatar Apr 11 '25 13:04 sanjayc2

Hi,

I'm just following up on this. I see it has been tagged as a feature request. However, in the "Describe a Solution you would like" section of my request, I had asked if my suggested "workaround" could be made to work. I would appreciate it if someone could get back to me about a short-term solution that you could code, while we wait for a response and resolution to the feature request.

In general, I must say you guys have been very responsive. Thanks v much in advance.

sanjayc2 avatar Apr 24 '25 17:04 sanjayc2

Hi @sanjayc2,

I suspect that what you try to achieve can be done solely by relying on OpenSearch functionalities.

First, I think you need to specify that a certain field is metadata, i.e.: the field name preceded by the prefix "metadata", example:

custom_query_meta_filter = {
    "query": {
        "bool": {
            "should": [
                {
                    "match": {
                        "metadata.file_path": "Coral Gold Resources"
                    }
                },
                {
                    "match":  {
                        "metadata.file_path": "CoralGoldResources"
                    }
                }
            ]
        }
    }

You can try to use fuzziness, example:

"fuzziness": "AUTO" to all match clauses,

but, beware that fuzziness is case-sensitive, so on pre-processing lowercase or uppercase all the file_paths.

Another alternative is to use a regex against that specific metadata field,

Example:

{
  "query": {
    "regexp": {
      "metadata.file_path": {
        "value": "Coral\\s*Gold\\s*Resources",
        "flags": "ALL",
        "case_insensitive": true
      }
    }
  }
}

I hope this helps! Let me know how it goes!

davidsbatista avatar May 15 '25 14:05 davidsbatista