langchain [community] Added SentenceWindowRetriever

[x] PR title: "package: description"
- Added appropriate title

Description

Adds a new type of retriever called Sentence Window Retriever
Also adds a modification to TextSplitter to help implement the retriever
[x] Add tests and docs: If you're adding a new integration, please include No tests added yet. Let me know if any specific tests are required. Plan to add documentation on how to run the retriever
[x] Lint and test: Tests ran successfully

May 03 '24 16:05 rsk2327

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
langchain	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	May 15, 2024 10:57am

May 03 '24 16:05 vercel[bot]

@eyurtsev So I have modified the implementation of SWR to be datastore agnostic.

I went with approach of defining a get_document_by_ids function at the vectorstores which enables a common method for querying vectorstore based on IDs.

One of the issues with the implementation of SWR is that the search functionality is not standardized across vectorstores.

Chroma : has similarity_search_by_vector and similarity_search_by_vector_with_score Pinecone : does not have similarity_search_by_vector but only similarity_search_by_vector_with_score Milvus : has similarity_search_by_vector but instead of similarity_search_by_vector_with_score has 'similarity_search_with_score_by_vector' which is probably a typo

I can work around the the different search method names for now, but it might be helpful for Pinecone to also have a 'similarity_search_by_vector' implementation and for Milvus to use the same standardized function names like the other vectorstores.

May 07 '24 18:05 rsk2327

@eyurtsev @efriis Could I get a review on this?

Let me know if I need to add any additional details to explain the changes made.

I did have a question on how to include the langchain_pinecone as a dependency within community. The unit tests throw an error when I import PineconeVectorStore from langchain_pinecone

May 09 '24 16:05 rsk2327

@eyurtsev @efriis Can I get a review on this?

May 13 '24 18:05 rsk2327

Deployment failed with the following error:

The provided GitHub repository does not contain the requested branch or commit reference. Please ensure the repository is not empty.

May 15 '24 10:05 vercel[bot]

@eyurtsev @efriis @hwchase17 @baskaryan Can I get a review on this?

May 15 '24 19:05 rsk2327

@rsk2327 You'll need to standby for ~1 month. We'll be focusing on the vectorstore abstractions after the 0.2 release.

The main things so far:

Addition of get_documents_by_ids to the base abstraction
Potentially addition of an id attribute on a document (so the ID is not randomly in the metadata).

Text splitters:

Determine what if any kind of metadata we should be propagating in the text splitter for provenance purposes

I'll leave some comments in the PR itself as well

May 16 '24 21:05 eyurtsev

closing as stale

Aug 23 '24 18:08 efriis

+1 on this PR I think it should get reopened

Sep 07 '24 21:09 icaroryan

langchain langchain copied to clipboard

[community] Added SentenceWindowRetriever

langchain
langchain copied to clipboard