haystack icon indicating copy to clipboard operation
haystack copied to clipboard

refactor: remove dummy vectors in weaviate

Open hsm207 opened this issue 2 years ago • 9 comments

Related Issues

  • fixes #4007

Proposed Changes:

Removed the dummy vector logic because weaviate can index documents without a vector

How did you test it?

Updated the document fixture to include a document without an embedding

Notes for the reviewer

Checklist

  • [x] I have read the contributors guidelines and the code of conduct
  • [x] I have updated the related issue with new insights and changes
  • [x] I added tests that demonstrate the correct behavior of the change
  • [x] I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:.
  • [x] I documented my code
  • [x] I ran pre-commit hooks and fixed any issue

hsm207 avatar Jan 30 '23 17:01 hsm207

Hey, @hsm207 thanks for the addition! Also, I noticed that the function get_embedding_count should be updated https://github.com/deepset-ai/haystack/blob/fffa2288639348690a69e91265f8bc8ff0fa4bc0/haystack/document_stores/weaviate.py#L598-L603 to reflect that not all documents will necessarily have embeddings anymore.

Similarily, this test in test_weaviate.py https://github.com/deepset-ai/haystack/blob/c855e18d7846728c4982922e768d8e4794cab5c3/test/document_stores/test_weaviate.py#L255-L262 should be updated to reflect that only 6 documents have embeddings now in the test.

sjrl avatar Jan 31 '23 12:01 sjrl

hey @sjrl,

Thanks for your review!

At the moment, it is not possible to apply a filter to only count the documents that have an embedding. I've made a feature request for this: weaviate/weaviate#2590

In the meantime, can we make this PR as draft or do you suggest we close it instead?

hsm207 avatar Feb 01 '23 12:02 hsm207

@hsm207 Thanks for opening the issue in the weaviate library. However, I think we can employ the same method as used by the InMemoryDocumentStore https://github.com/deepset-ai/haystack/blob/7b3d7ee83a277c9a7ab9a8c7f1c0f02bd7d905ad/haystack/document_stores/memory.py#L572-L578 This might not be the most efficient, but we can use this solution for now and update this once the issue in weaviate is resolved. What do you think?

sjrl avatar Feb 01 '23 12:02 sjrl

@sjrl

However, I think we can employ the same method as used by the InMemoryDocumentStore

The approach used in the InMemoryDocumentStore is to retrieve all the docs and then loop through them to check for the embedding property.

At the moment, this approach will fail in weaviate if the number of docs retrieved is more than QUERY_MAXIMUM_RESULTS which defaults to 10,000. There's another feature request to fix this (see: weaviate/weaviate#2302)

If we proceed with this approach, then we will will introduce another limitation due to this failure mode. The other being updating embeddings (see: #3390)

I don't think this PR is worth the trade off.

hsm207 avatar Feb 01 '23 13:02 hsm207

@hsm207 Thanks for the additional information. In that case, it could be worth just raising a NotImplementedError for get_embedding_count while we wait for the above-mentioned fixes if you believe this change is urgent. Otherwise I agree that waiting for the other issues to be resolved before merging this PR would be a good idea.

sjrl avatar Feb 01 '23 13:02 sjrl

@sjrl

it could be worth just raising a NotImplementedError for get_embedding_count while we wait for the above-mentioned fixes if you believe this change is urgent.

I don't believe this change to be urgent because because the overhead introduced by the dummy vector is minimal since it is built using numpy (unless the user regularly uploads millions of docs at a time!).

So, I'll leave this PR as a draft and pick it back up when the required features in weaviate are implemented.

hsm207 avatar Feb 01 '23 14:02 hsm207

@hsm207 any news on this?

silvanocerza avatar Mar 29 '23 13:03 silvanocerza

@silvanocerza

@hsm207 any news on this?

Issue https://github.com/weaviate/weaviate/issues/2590 is still in the backlog and not part of our roadmap.

However, @sjrl suggestion is possible now starting version 1.18.0 using the cursor API

hsm207 avatar Mar 29 '23 14:03 hsm207

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar May 18 '23 10:05 CLAassistant

Not sure about the status, I'm closing this for now we can re-evaluate later

masci avatar Nov 23 '23 08:11 masci