langchain icon indicating copy to clipboard operation
langchain copied to clipboard

DocArray as a Retriever

Open jupyterjazz opened this issue 1 year ago • 4 comments

DocArray as a Retriever

DocArray is an open-source tool for managing your multi-modal data. It offers flexibility to store and search through your data using various document index backends. This PR introduces DocArrayRetriever - which works with any available backend and serves as a retriever for Langchain apps.

Also, I added 2 notebooks: DocArray Backends - intro to all 5 currently supported backends, how to initialize, index, and use them as a retriever DocArray Usage - showcasing what additional search parameters you can pass to create versatile retrievers

Example:

from docarray.index import InMemoryExactNNIndex
from docarray import BaseDoc, DocList
from docarray.typing import NdArray
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.retrievers import DocArrayRetriever


# define document schema
class MyDoc(BaseDoc):
    description: str
    description_embedding: NdArray[1536]


embeddings = OpenAIEmbeddings()
# create documents
descriptions = ["description 1", "description 2"]
desc_embeddings = embeddings.embed_documents(texts=descriptions)
docs = DocList[MyDoc](
    [
        MyDoc(description=desc, description_embedding=embedding)
        for desc, embedding in zip(descriptions, desc_embeddings)
    ]
)

# initialize document index with data
db = InMemoryExactNNIndex[MyDoc](docs)

# create a retriever
retriever = DocArrayRetriever(
    index=db,
    embeddings=embeddings,
    search_field="description_embedding",
    content_field="description",
)

# find the relevant document
doc = retriever.get_relevant_documents("action movies")
print(doc)

Who can review?

@dev2049

jupyterjazz avatar Jun 12 '23 07:06 jupyterjazz

It would be nice to also add jina's annlite for the vector store option as well.

jpzhangvincent avatar Jun 12 '23 18:06 jpzhangvincent

hey @jpzhangvincent, annlite is not yet compatible with the new docarray version, but we might do it in the future, thanks for the suggestion!

jupyterjazz avatar Jun 13 '23 11:06 jupyterjazz

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
langchain ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jun 16, 2023 7:45pm

vercel[bot] avatar Jun 16 '23 08:06 vercel[bot]

@hwchase17 @vowelparrot @dev2049

I'm not sure why Vercel is failing, I think it fails for all other recent PRs.

jupyterjazz avatar Jun 16 '23 08:06 jupyterjazz

@jupyterjazz is attempting to deploy a commit to the LangChain Team on Vercel.

A member of the Team first needs to authorize it.

vercel[bot] avatar Jun 16 '23 19:06 vercel[bot]

hey @hwchase17 @vowelparrot @dev2049

I think Vercel needs some approval from your side and CI should be green afterwards. The comment about separate notebooks is addressed!

jupyterjazz avatar Jun 16 '23 20:06 jupyterjazz