langchain icon indicating copy to clipboard operation
langchain copied to clipboard

community: Milvus supports add & delete texts by ids

Open jaelgu opened this issue 5 months ago • 4 comments

Description

To support langchain indexing as requested by users, vectorstore Milvus needs to support:

  • document addition by id (add_documents method with ids argument)
  • delete by id (delete method with ids argument)

Example usage:

from langchain.indexes import SQLRecordManager, index
from langchain.schema import Document
from langchain_community.vectorstores import Milvus
from langchain_openai import OpenAIEmbeddings

collection_name = "test_index"
embedding = OpenAIEmbeddings()
vectorstore = Milvus(embedding_function=embedding, collection_name=collection_name)

namespace = f"milvus/{collection_name}"
record_manager = SQLRecordManager(
    namespace, db_url="sqlite:///record_manager_cache.sql"
)
record_manager.create_schema()

doc1 = Document(page_content="kitty", metadata={"source": "kitty.txt"})
doc2 = Document(page_content="doggy", metadata={"source": "doggy.txt"})

index(
    [doc1, doc1, doc2],
    record_manager,
    vectorstore,
    cleanup="incremental",  # None, "incremental", or "full"
    source_id_key="source",
)

Fix issues

Fix https://github.com/milvus-io/milvus/issues/30112

jaelgu avatar Jan 19 '24 07:01 jaelgu

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
langchain ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jan 29, 2024 2:06am

vercel[bot] avatar Jan 19 '24 07:01 vercel[bot]

PR Analysis

  • Main theme: Enhancing Milvus vector store compatibility and functionality
  • PR summary: This PR introduces the ability to add and delete documents by ID in the Milvus vector store, improving its compatibility with the langchain indexing feature. It includes changes to the primary field data type and adds the necessary methods for document management by ID.
  • Type of PR: Enhancement
  • Score (0-100, higher is better): 85
  • Relevant tests added: Yes
  • Estimated effort to review (1-5, lower is better): 2 The PR is well scoped and the changes are clear, but careful attention is needed to ensure the new functionality integrates smoothly with the existing codebase.

PR Feedback

  • General suggestions: The PR effectively introduces new functionalities for adding and deleting documents by ID, which is a significant enhancement. However, there are some concerns regarding the data type changes and potential edge cases that might not be covered by the current implementation. It's also important to ensure that the new features are compatible with existing data and to consider the implications of changing the primary key data type on existing collections.

Code feedback

file: libs/community/langchain_community/vectorstores/milvus.py

  • Suggestions:
  1. Consider validating the ids list to ensure all provided IDs are unique to prevent potential primary key conflicts. [important] Relevant line:In libs/community/langchain_community/vectorstores/milvus.py at line 465
+ ids: Optional[List[str]] = None,
  1. Ensure that the max_length for the DataType.VARCHAR is consistent with the expected ID sizes to avoid truncation issues. [important] Relevant line:In libs/community/langchain_community/vectorstores/milvus.py at line 329
+ self._primary_field, DataType.VARCHAR, is_primary=True, auto_id=False, max_length=65_535
  1. Add error handling for the delete method to manage cases where deletion fails due to missing or incorrect IDs. [important] Relevant line:In libs/community/langchain_community/vectorstores/milvus.py at line 864
+ def delete(self,
  1. Review the necessity of removing the primary field from the output fields in the similarity_search_with_score_by_vector method, as this might be useful information for the user. [medium] Relevant line:In libs/community/langchain_community/vectorstores/milvus.py at line 700
+ output_fields = [x for x in self.fields if x != self._primary_field]
  1. Consider implementing a migration strategy for collections that were created with the previous INT64 primary key type to the new VARCHAR type. [important] Relevant line:In libs/community/langchain_community/vectorstores/milvus.py at line 329
+ self._primary_field, DataType.VARCHAR, is_primary=True, auto_id=False, max_length=65_535

file: libs/community/tests/integration_tests/vectorstores/test_milvus.py

  • Suggestions:
  1. Expand the test coverage to include scenarios where documents are added with duplicate IDs and ensure the system behaves as expected. [important] Relevant line:In libs/community/tests/integration_tests/vectorstores/test_milvus.py at line 11
+def _milvus_from_texts(

adhirpotdarbito avatar Jan 20 '24 13:01 adhirpotdarbito

@baskaryan Added the following according to PR analysis:

  • allow control of auto_id
    • auto_id = False (by default): user needs to provide ids (string, len <= 65535 bytes) when adding texts
    • auto_id = True (work same as before): adapt previous collection, auto id will be integers
    • cls.add_texts will set auto_id False if ids is provided else True
  • value check of ids
  • search results: output primary key with metadata when auto_id disabled
  • test case for duplicate ids

jaelgu avatar Jan 22 '24 11:01 jaelgu

@baskaryan updated to resolve conflicts due to recent commits in master

jaelgu avatar Jan 25 '24 02:01 jaelgu

@baskaryan updated to resolve conflicts with master again

jaelgu avatar Jan 26 '24 02:01 jaelgu

@jaelgu @baskaryan hi all, this PR is incompatible with old version, and aslo not working even set auto_id=False.

issue related: #17147 #17006 etc..

Please take a look.

axiangcoding avatar Feb 07 '24 10:02 axiangcoding