haystack-core-integrations icon indicating copy to clipboard operation
haystack-core-integrations copied to clipboard

Extending MongoDBAtlasDocumentStore to support custom schema

Open verkhovin opened this issue 1 year ago • 2 comments

Is your feature request related to a problem? Please describe. The current implementation of MongoDBAtlasDocumentStore only supports specific MongoDB document schema. Content is expected to be stored in the content field, and metadata must be within a meta subdocument. This schema requirement is enforced by the $project stage in the aggregation pipeline executed by _embedding_retrieval function:

           {
                "$vectorSearch": {
                    "index": self.vector_search_index,
                    "path": "embedding",
                    "queryVector": query_embedding,
                    "numCandidates": 100,
                    "limit": top_k,
                    "filter": filters,
                }
            },
            {
                "$project": {
                    "_id": 0,
                    "content": 1,
                    "dataframe": 1,
                    "blob": 1,
                    "meta": 1,
                    "embedding": 1,
                    "score": {"$meta": "vectorSearchScore"},
                }
            }

This tightly couples the Haystack Document representation with the database schema, which can be inconvenient. I have a vector store in MongoDB with an existing schema defined when I was using langchaig. Specifically, I have the document's content stored in a text field, and I have some metadata stored in different fields of a MongoDB document (like source storing the original document location reference). I would prefer to avoid migrating to a new schema dictated by MongoDBAtlasDocumentStore.

Describe the solution you'd like I propose adding the ability to override the $project stage of the aggregation pipeline partially, optionally, while retaining the existing behavior as a default. For example, initializing the MongoDBAtlasDocumentStore could look like this:

MongoDBAtlasDocumentStore(
    database_name="db",
    collection_name="embedded_docs",
    vector_search_index='index',
    content_field_key='text', # maps "text" field in MongoDB to Document's content
    meta_project_mapping={
       {"source": "$source"}  # allows to flexibly build meta from a MongoDB doc fields
    }

self.content_field_key and self.meta_project_mapping would be then used in the $project aggregation pipeline stage. What do you think?

Describe alternatives you've considered I extended MongoDBAtlasDocumentStore in my project and made the described change. While this approach works, I was wondering if it would be beneficial to include it in the library.

Additional context I can submit a PR :)

verkhovin avatar Apr 24 '24 21:04 verkhovin

Hi can I have this issue? can you give me links to simular prs?

MetroCat69 avatar Apr 16 '25 08:04 MetroCat69

Hey @verkhovin this has been partially addressed in:

  • https://github.com/deepset-ai/haystack-core-integrations/pull/1721
  • https://github.com/deepset-ai/haystack-core-integrations/pull/1708

which allow you to specify a custom content and embedding fields.

sjrl avatar May 15 '25 13:05 sjrl