langchain
langchain copied to clipboard
[community] Added SentenceWindowRetriever
- [x] PR title: "package: description"
- Added appropriate title
Description
-
Adds a new type of retriever called Sentence Window Retriever
-
Also adds a modification to TextSplitter to help implement the retriever
-
[x] Add tests and docs: If you're adding a new integration, please include No tests added yet. Let me know if any specific tests are required. Plan to add documentation on how to run the retriever
-
[x] Lint and test: Tests ran successfully
The latest updates on your projects. Learn more about Vercel for Git ↗︎
Name | Status | Preview | Comments | Updated (UTC) |
---|---|---|---|---|
langchain | ✅ Ready (Inspect) | Visit Preview | 💬 Add feedback | May 15, 2024 10:57am |
@eyurtsev So I have modified the implementation of SWR to be datastore agnostic.
I went with approach of defining a get_document_by_ids
function at the vectorstores which enables a common method for querying vectorstore based on IDs.
One of the issues with the implementation of SWR is that the search functionality is not standardized across vectorstores.
Chroma : has similarity_search_by_vector
and similarity_search_by_vector_with_score
Pinecone : does not have similarity_search_by_vector
but only similarity_search_by_vector_with_score
Milvus : has similarity_search_by_vector
but instead of similarity_search_by_vector_with_score
has 'similarity_search_with_score_by_vector' which is probably a typo
I can work around the the different search method names for now, but it might be helpful for Pinecone to also have a 'similarity_search_by_vector' implementation and for Milvus to use the same standardized function names like the other vectorstores.
@eyurtsev @efriis Could I get a review on this?
Let me know if I need to add any additional details to explain the changes made.
I did have a question on how to include the langchain_pinecone as a dependency within community. The unit tests throw an error when I import PineconeVectorStore from langchain_pinecone
@eyurtsev @efriis Can I get a review on this?
Deployment failed with the following error:
The provided GitHub repository does not contain the requested branch or commit reference. Please ensure the repository is not empty.
@eyurtsev @efriis @hwchase17 @baskaryan Can I get a review on this?
@rsk2327 You'll need to standby for ~1 month. We'll be focusing on the vectorstore abstractions after the 0.2 release.
The main things so far:
- Addition of
get_documents_by_ids
to the base abstraction - Potentially addition of an
id
attribute on a document (so the ID is not randomly in the metadata).
Text splitters:
- Determine what if any kind of metadata we should be propagating in the text splitter for provenance purposes
I'll leave some comments in the PR itself as well
closing as stale
+1 on this PR I think it should get reopened