refactor: remove dummy vectors in weaviate
Related Issues
- fixes #4007
Proposed Changes:
Removed the dummy vector logic because weaviate can index documents without a vector
How did you test it?
Updated the document fixture to include a document without an embedding
Notes for the reviewer
Checklist
- [x] I have read the contributors guidelines and the code of conduct
- [x] I have updated the related issue with new insights and changes
- [x] I added tests that demonstrate the correct behavior of the change
- [x] I've used one of the conventional commit types for my PR title:
fix:,feat:,build:,chore:,ci:,docs:,style:,refactor:,perf:,test:. - [x] I documented my code
- [x] I ran pre-commit hooks and fixed any issue
Hey, @hsm207 thanks for the addition! Also, I noticed that the function get_embedding_count should be updated https://github.com/deepset-ai/haystack/blob/fffa2288639348690a69e91265f8bc8ff0fa4bc0/haystack/document_stores/weaviate.py#L598-L603
to reflect that not all documents will necessarily have embeddings anymore.
Similarily, this test in test_weaviate.py https://github.com/deepset-ai/haystack/blob/c855e18d7846728c4982922e768d8e4794cab5c3/test/document_stores/test_weaviate.py#L255-L262
should be updated to reflect that only 6 documents have embeddings now in the test.
hey @sjrl,
Thanks for your review!
At the moment, it is not possible to apply a filter to only count the documents that have an embedding. I've made a feature request for this: weaviate/weaviate#2590
In the meantime, can we make this PR as draft or do you suggest we close it instead?
@hsm207 Thanks for opening the issue in the weaviate library. However, I think we can employ the same method as used by the InMemoryDocumentStore https://github.com/deepset-ai/haystack/blob/7b3d7ee83a277c9a7ab9a8c7f1c0f02bd7d905ad/haystack/document_stores/memory.py#L572-L578 This might not be the most efficient, but we can use this solution for now and update this once the issue in weaviate is resolved. What do you think?
@sjrl
However, I think we can employ the same method as used by the InMemoryDocumentStore
The approach used in the InMemoryDocumentStore is to retrieve all the docs and then loop through them to check for the embedding property.
At the moment, this approach will fail in weaviate if the number of docs retrieved is more than QUERY_MAXIMUM_RESULTS which defaults to 10,000. There's another feature request to fix this (see: weaviate/weaviate#2302)
If we proceed with this approach, then we will will introduce another limitation due to this failure mode. The other being updating embeddings (see: #3390)
I don't think this PR is worth the trade off.
@hsm207 Thanks for the additional information. In that case, it could be worth just raising a NotImplementedError for get_embedding_count while we wait for the above-mentioned fixes if you believe this change is urgent. Otherwise I agree that waiting for the other issues to be resolved before merging this PR would be a good idea.
@sjrl
it could be worth just raising a
NotImplementedErrorforget_embedding_countwhile we wait for the above-mentioned fixes if you believe this change is urgent.
I don't believe this change to be urgent because because the overhead introduced by the dummy vector is minimal since it is built using numpy (unless the user regularly uploads millions of docs at a time!).
So, I'll leave this PR as a draft and pick it back up when the required features in weaviate are implemented.
@hsm207 any news on this?
@silvanocerza
@hsm207 any news on this?
Issue https://github.com/weaviate/weaviate/issues/2590 is still in the backlog and not part of our roadmap.
However, @sjrl suggestion is possible now starting version 1.18.0 using the cursor API
Not sure about the status, I'm closing this for now we can re-evaluate later