haystack-core-integrations
haystack-core-integrations copied to clipboard
QdrantDocumentStore add optional index parameter for querying
Is your feature request related to a problem? Please describe. I am using haystack with fastAPI as the base for a chatbot. I want to support the usage of different collections through the api.
Describe the solution you'd like Currently, the way I see it I would have to at least initialize a whole new DocumentStore or even pipeline. I would like to add an optional index parameter to the querying functions:
_query_by_sparse _query_by_embedding _query_hybrid
basically just like in get_documents_by_id. Additionally the run functions of the qdrant retrievers would also need that optional index parameter.
It would also be nice to have that for the DocumentWriter but I'm not sure if it makes sense since it is not specialized for Qdrant and I dont know how other document stores behave in that case. But since indexing is generally not done as frequently as querying it's not that big of a deal to reinitialize the pipeline.
Describe alternatives you've considered As far as I can tell, the only alternative right now would be to reinitialize my pipeline every time with the adjusted document store, while it basically just changes one variable.
Let me know if I've overlooked something or if you think this idea makes sense aswell, I'd be happy to make a PR for it :)
Hello @ruben-vb how many different indices are you planning to work with in your use case? So far, initializing and using multiple DocumentStores in one pipeline (or having separate pipelines) works in the examples I have seen. OpenSearchDocumentStore or ElasticsearchDocumentStore and their respective retrievers also don't have an index parameter for querying. Only at init time of OpenSearchDocumentStore or ElasticsearchDocumentStore can the index name be specified. So it's consistent with QdrantDocumentStore.
Hey @julian-risch,
For the time being it's just two but we're planning to support more in the future for different projects so that they can easily integrate the API. We're using MinIO as the file storage and my idea was, that other projects would just need to create a minio bucket, upload files and call the api with the bucket as a query parameter without any additional adjustments to the code or environment.
But I think for my case using env variables and several document store/pipeline instances based on them is a viable solution since the amount of future indices is manageable. So I'll go with that, thanks :)
Ok as a little update, I've found a decent solution. Every heavier aspect of the initialization for my indexing pipelines now happens inside my fastapi background tasks. This way it is no problem that I have to reinitialize my whole pipeline for this to work as I initially intended.
It still creates about 150ms overhead for my querying operations, but thats acceptable, as it's not a noticable difference since querying and generating the response takes some time anyway.
I understand that, as it is right now, it's consistent with the other document stores. But it also kind of feels inconsistent in itself to me since the get_documents_by_id method allows specifying an index while the other methods don't.
I think we can close this issue now.
Indeed, in Haystack, a document store corresponds to an index.
The observation about the get_documents_by_id method is correct and we should remove the index parameter for consistency.