haystack-core-integrations
haystack-core-integrations copied to clipboard
Extending MongoDBAtlasDocumentStore to support custom schema
Is your feature request related to a problem? Please describe.
The current implementation of MongoDBAtlasDocumentStore only supports specific MongoDB document schema. Content is expected to be stored in the content field, and metadata must be within a meta subdocument. This schema requirement is enforced by the $project stage in the aggregation pipeline executed by _embedding_retrieval function:
{
"$vectorSearch": {
"index": self.vector_search_index,
"path": "embedding",
"queryVector": query_embedding,
"numCandidates": 100,
"limit": top_k,
"filter": filters,
}
},
{
"$project": {
"_id": 0,
"content": 1,
"dataframe": 1,
"blob": 1,
"meta": 1,
"embedding": 1,
"score": {"$meta": "vectorSearchScore"},
}
}
This tightly couples the Haystack Document representation with the database schema, which can be inconvenient. I have a vector store in MongoDB with an existing schema defined when I was using langchaig. Specifically, I have the document's content stored in a text field, and I have some metadata stored in different fields of a MongoDB document (like source storing the original document location reference). I would prefer to avoid migrating to a new schema dictated by MongoDBAtlasDocumentStore.
Describe the solution you'd like
I propose adding the ability to override the $project stage of the aggregation pipeline partially, optionally, while retaining the existing behavior as a default. For example, initializing the MongoDBAtlasDocumentStore could look like this:
MongoDBAtlasDocumentStore(
database_name="db",
collection_name="embedded_docs",
vector_search_index='index',
content_field_key='text', # maps "text" field in MongoDB to Document's content
meta_project_mapping={
{"source": "$source"} # allows to flexibly build meta from a MongoDB doc fields
}
self.content_field_key and self.meta_project_mapping would be then used in the $project aggregation pipeline stage. What do you think?
Describe alternatives you've considered I extended MongoDBAtlasDocumentStore in my project and made the described change. While this approach works, I was wondering if it would be beneficial to include it in the library.
Additional context I can submit a PR :)
Hi can I have this issue? can you give me links to simular prs?
Hey @verkhovin this has been partially addressed in:
- https://github.com/deepset-ai/haystack-core-integrations/pull/1721
- https://github.com/deepset-ai/haystack-core-integrations/pull/1708
which allow you to specify a custom content and embedding fields.