langchain icon indicating copy to clipboard operation
langchain copied to clipboard

[community] Added SentenceWindowRetriever

Open rsk2327 opened this issue 9 months ago • 7 comments

  • [x] PR title: "package: description"
    • Added appropriate title

Description

  • Adds a new type of retriever called Sentence Window Retriever

  • Also adds a modification to TextSplitter to help implement the retriever

  • [x] Add tests and docs: If you're adding a new integration, please include No tests added yet. Let me know if any specific tests are required. Plan to add documentation on how to run the retriever

  • [x] Lint and test: Tests ran successfully

rsk2327 avatar May 03 '24 16:05 rsk2327

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
langchain ✅ Ready (Inspect) Visit Preview 💬 Add feedback May 15, 2024 10:57am

vercel[bot] avatar May 03 '24 16:05 vercel[bot]

@eyurtsev So I have modified the implementation of SWR to be datastore agnostic.

I went with approach of defining a get_document_by_ids function at the vectorstores which enables a common method for querying vectorstore based on IDs.

One of the issues with the implementation of SWR is that the search functionality is not standardized across vectorstores.

Chroma : has similarity_search_by_vector and similarity_search_by_vector_with_score Pinecone : does not have similarity_search_by_vector but only similarity_search_by_vector_with_score Milvus : has similarity_search_by_vector but instead of similarity_search_by_vector_with_score has 'similarity_search_with_score_by_vector' which is probably a typo

I can work around the the different search method names for now, but it might be helpful for Pinecone to also have a 'similarity_search_by_vector' implementation and for Milvus to use the same standardized function names like the other vectorstores.

rsk2327 avatar May 07 '24 18:05 rsk2327

@eyurtsev @efriis Could I get a review on this?

Let me know if I need to add any additional details to explain the changes made.

I did have a question on how to include the langchain_pinecone as a dependency within community. The unit tests throw an error when I import PineconeVectorStore from langchain_pinecone

rsk2327 avatar May 09 '24 16:05 rsk2327

@eyurtsev @efriis Can I get a review on this?

rsk2327 avatar May 13 '24 18:05 rsk2327

Deployment failed with the following error:

The provided GitHub repository does not contain the requested branch or commit reference. Please ensure the repository is not empty.

vercel[bot] avatar May 15 '24 10:05 vercel[bot]

@eyurtsev @efriis @hwchase17 @baskaryan Can I get a review on this?

rsk2327 avatar May 15 '24 19:05 rsk2327

@rsk2327 You'll need to standby for ~1 month. We'll be focusing on the vectorstore abstractions after the 0.2 release.

The main things so far:

  • Addition of get_documents_by_ids to the base abstraction
  • Potentially addition of an id attribute on a document (so the ID is not randomly in the metadata).

Text splitters:

  • Determine what if any kind of metadata we should be propagating in the text splitter for provenance purposes

I'll leave some comments in the PR itself as well

eyurtsev avatar May 16 '24 21:05 eyurtsev

closing as stale

efriis avatar Aug 23 '24 18:08 efriis

+1 on this PR I think it should get reopened

icaroryan avatar Sep 07 '24 21:09 icaroryan