langchain
langchain copied to clipboard
community: Milvus supports add & delete texts by ids
Description
To support langchain indexing as requested by users, vectorstore Milvus needs to support:
- document addition by id (
add_documents
method withids
argument) - delete by id (
delete
method withids
argument)
Example usage:
from langchain.indexes import SQLRecordManager, index
from langchain.schema import Document
from langchain_community.vectorstores import Milvus
from langchain_openai import OpenAIEmbeddings
collection_name = "test_index"
embedding = OpenAIEmbeddings()
vectorstore = Milvus(embedding_function=embedding, collection_name=collection_name)
namespace = f"milvus/{collection_name}"
record_manager = SQLRecordManager(
namespace, db_url="sqlite:///record_manager_cache.sql"
)
record_manager.create_schema()
doc1 = Document(page_content="kitty", metadata={"source": "kitty.txt"})
doc2 = Document(page_content="doggy", metadata={"source": "doggy.txt"})
index(
[doc1, doc1, doc2],
record_manager,
vectorstore,
cleanup="incremental", # None, "incremental", or "full"
source_id_key="source",
)
Fix issues
Fix https://github.com/milvus-io/milvus/issues/30112
The latest updates on your projects. Learn more about Vercel for Git ↗︎
Name | Status | Preview | Comments | Updated (UTC) |
---|---|---|---|---|
langchain | ✅ Ready (Inspect) | Visit Preview | 💬 Add feedback | Jan 29, 2024 2:06am |
PR Analysis
- Main theme: Enhancing Milvus vector store compatibility and functionality
- PR summary: This PR introduces the ability to add and delete documents by ID in the Milvus vector store, improving its compatibility with the langchain indexing feature. It includes changes to the primary field data type and adds the necessary methods for document management by ID.
- Type of PR: Enhancement
- Score (0-100, higher is better): 85
- Relevant tests added: Yes
- Estimated effort to review (1-5, lower is better): 2 The PR is well scoped and the changes are clear, but careful attention is needed to ensure the new functionality integrates smoothly with the existing codebase.
PR Feedback
- General suggestions: The PR effectively introduces new functionalities for adding and deleting documents by ID, which is a significant enhancement. However, there are some concerns regarding the data type changes and potential edge cases that might not be covered by the current implementation. It's also important to ensure that the new features are compatible with existing data and to consider the implications of changing the primary key data type on existing collections.
Code feedback
file: libs/community/langchain_community/vectorstores/milvus.py
- Suggestions:
- Consider validating the
ids
list to ensure all provided IDs are unique to prevent potential primary key conflicts. [important] Relevant line:In libs/community/langchain_community/vectorstores/milvus.py at line 465
+ ids: Optional[List[str]] = None,
- Ensure that the
max_length
for theDataType.VARCHAR
is consistent with the expected ID sizes to avoid truncation issues. [important] Relevant line:In libs/community/langchain_community/vectorstores/milvus.py at line 329
+ self._primary_field, DataType.VARCHAR, is_primary=True, auto_id=False, max_length=65_535
- Add error handling for the
delete
method to manage cases where deletion fails due to missing or incorrect IDs. [important] Relevant line:In libs/community/langchain_community/vectorstores/milvus.py at line 864
+ def delete(self,
- Review the necessity of removing the primary field from the output fields in the
similarity_search_with_score_by_vector
method, as this might be useful information for the user. [medium] Relevant line:In libs/community/langchain_community/vectorstores/milvus.py at line 700
+ output_fields = [x for x in self.fields if x != self._primary_field]
- Consider implementing a migration strategy for collections that were created with the previous
INT64
primary key type to the newVARCHAR
type. [important] Relevant line:In libs/community/langchain_community/vectorstores/milvus.py at line 329
+ self._primary_field, DataType.VARCHAR, is_primary=True, auto_id=False, max_length=65_535
file: libs/community/tests/integration_tests/vectorstores/test_milvus.py
- Suggestions:
- Expand the test coverage to include scenarios where documents are added with duplicate IDs and ensure the system behaves as expected. [important] Relevant line:In libs/community/tests/integration_tests/vectorstores/test_milvus.py at line 11
+def _milvus_from_texts(
@baskaryan Added the following according to PR analysis:
- allow control of auto_id
- auto_id = False (by default): user needs to provide ids (string, len <= 65535 bytes) when adding texts
- auto_id = True (work same as before): adapt previous collection, auto id will be integers
-
cls.add_texts
will set auto_id False if ids is provided else True
- value check of ids
- search results: output primary key with metadata when auto_id disabled
- test case for duplicate ids
@baskaryan updated to resolve conflicts due to recent commits in master
@baskaryan updated to resolve conflicts with master again
@jaelgu @baskaryan hi all, this PR is incompatible with old version, and aslo not working even set auto_id=False
.
issue related: #17147 #17006 etc..
Please take a look.