langchain Seeking solution for combined retrievers, or retrieving from multiple vectorstores with sources, to maintain separate Namespaces.

It seems maintaining separate namespaces in your vector DB is helpful and/or necessary in making sure an LLM can answer compare/contrast questions that need to reference texts separated by dates like "03/2023" vs. "03/2022" without getting confused.

To that end, there's a need to retrieve from multiple vectorstores, yet I can't find a straightforward solution.

I have tried a few things:

Extending the ConversationalRetrievalChain to accept a list of retrievers:

class MultiRetrieverConversationalRetrievalChain(ConversationalRetrievalChain):
    """Chain for chatting with multiple indexes."""

    retrievers: List[BaseRetriever]
    """Indexes to connect to."""

    def _get_docs(self, question: str, inputs: Dict[str, Any]) -> List[Document]:
        all_docs = []
        for retriever in self.retrievers:
            docs = retriever.get_relevant_documents(question)
            all_docs.extend(docs)
        return self._reduce_tokens_below_limit(all_docs)

    async def _aget_docs(self, question: str, inputs: Dict[str, Any]) -> List[Document]:
        all_docs = []
        for retriever in self.retrievers:
            docs = await retriever.aget_relevant_documents(question)
            all_docs.extend(docs)
        return self._reduce_tokens_below_limit(all_docs)

This became a bit unwieldy as it ran into validation errors with Pydantic, but I don't see why a more competent dev wouldn't be able to manage this.

I tried combining retrievers (suggestion from kapa.ai):

embeddings = OpenAIEmbeddings()
march_documents = Pinecone.from_existing_index(index_name="langchain2", embedding=embeddings, namespace="March 2023")
feb_documents = Pinecone.from_existing_index(index_name="langchain2", embedding=embeddings, namespace="February 2023")

combined_docs = feb_documents + march_documents
# Create a RetrievalQAWithSourcesChain using the combined retriever
chain = RetrievalQAWithSourcesChain.from_chain_type(OpenAI(temperature=0), chain_type="stuff", retriever=combined_docs) 
# does not work as as_retriever() either

Tried using an Agent with VectorStoreRouterToolkit, which seems to be built for this kind of task, yet provides terrible answers for some reason that I need to dive deeper into. Terrible answers being - does not listen when I instruct like "Do not summarize, list everything about XYZ..." Further, I need/prefer the results from similarity_search, returning top_k for my use-case, which the agent doesn't seem to provide.

Is there a workaround to my problem? How do I maintain separation of namespaces, so that I can have the LLM answer questions about separate documents, and also be able to provide the source for the separate documents all from within a single chain?

May 02 '23 18:05 simonfromla

Hi did you solve the problem? I tried with solution 3 and the router doesn't seem to stop even if it gets the right answer from the first vectorstore. It continues running like below

============================================ Entering new AgentExecutor chain... This is a philosophy question Action: philosophy Action Input: what is the veil of ignorance Observation: The Veil of Ignorance is a way of modeling impartiality. It is one way to model impartiality, but there are other ways. It is a condition in which everyone is ignorant of their position in society or their personal characteristics, and therefore, they make decisions behind the veil of ignorance without knowing the outcomes of the decisions.<|im_end|> Thought: I need more information about the history of the concept of the veil of ignorance Action: external data Action Input: history of the veil of ignorance Observation: I don't know.

Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

The fact that these models can memorize and plagiarize text (Jin et al., 2020; Li et al., 2021) raises concerns about the potential legal risk of their deployment, especially given the likely exponential growth of these types of models in the near future (Shi et al.,

Question: what can models do? Helpful Answer: memorize and plagiarize text

Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

to provide a formalism for the kinds of reasoning that people do, including reasoning about other people's beliefs, desires and intentions (Goldman, 1974; Lewis, 1969; Stalnaker, 1984). Game theory is also used in economics, political science, and other social sciences to study collective decision making (Rapoport, 1960; von Neumann & Morgenstern, 1944). Game theory Thought: This is a philosophy question Question: What is the main purpose of game theory? Action: philosophy ... return this.context; }

// This method takes in a user's message as an input and returns a response Thought:

May 08 '23 03:05 xinj7

Hi did you solve the problem?

I've decided to go with separated vectorstores, passing similarity results over as context to the prompt. Also, FAISS has inbuilt methods for combining multiple vectorstores if needed, which is what I'm going with. The new updates to the agents seem like they would be perfect for the task. From cursory look, seems like you'd create several stores, add them as options to the agents tools, and let it do its thing.

May 20 '23 17:05 simonfromla

I did this (I swapped the return type out for Iterable):

class CombineRetriever(BaseRetriever):
    def __init__(self, retrievers):
        self.retrievers = retrievers
        
    def get_relevant_documents(self, query: str) -> Iterable[Document]:
        for retriever in self.retrievers:
            for doc in retriever.get_relevant_documents(query):
                yield doc
    async def aget_relevant_documents(self, query: str) -> Iterable[Document]:
        for retriever in self.retrievers:
            for doc in await retriever.get_relevant_documents(query):
                yield doc

May 23 '23 15:05 cancan101

@simonfromla @cancan101 Hi guys! I just submitted an idea for a merger retriever that perhaps could help you with this use case? Please take a look or give it a try and let me know. https://github.com/hwchase17/langchain/pull/5798

Jun 06 '23 22:06 GMartin-dev

#5798 ought to do it. Closing.

Jun 09 '23 02:06 simonfromla

I did this (I swapped the return type out for Iterable):

class CombineRetriever(BaseRetriever):
    def __init__(self, retrievers):
        self.retrievers = retrievers
        
    def get_relevant_documents(self, query: str) -> Iterable[Document]:
        for retriever in self.retrievers:
            for doc in retriever.get_relevant_documents(query):
                yield doc
    async def aget_relevant_documents(self, query: str) -> Iterable[Document]:
        for retriever in self.retrievers:
            for doc in await retriever.get_relevant_documents(query):
                yield doc

I just tried this and unfortunately the similarity score is per store and dependent on the data volume, so the small store gets matched quicker and thus results are not as expected ;(

Jun 27 '23 17:06 Morriz

I did this (I swapped the return type out for Iterable):

class CombineRetriever(BaseRetriever):
    def __init__(self, retrievers):
        self.retrievers = retrievers
        
    def get_relevant_documents(self, query: str) -> Iterable[Document]:
        for retriever in self.retrievers:
            for doc in retriever.get_relevant_documents(query):
                yield doc
    async def aget_relevant_documents(self, query: str) -> Iterable[Document]:
        for retriever in self.retrievers:
            for doc in await retriever.get_relevant_documents(query):
                yield doc

I just tried this and unfortunately the similarity score is per store and dependent on the data volume, so the small store gets matched quicker and thus results are not as expected ;(

You are right, in my particular use case I needed equal weight by result independently. I think you can "play around a little" with this, using proportional "k" to the total elements for each retriever (far from perfect I know). Another work around would be using a document compresor over the merged results, example: https://github.com/hwchase17/langchain/blob/master/langchain/retrievers/document_compressors/cohere_rerank.py Another approach will be to start implementing diff merging mechanisms, like "search_type" but "merge_type" so we can select diff merge logic. I do not have the time to start working on this immediately, i'm working in another document compressor idea but if you already have some approach to try I will be happy to give you a hand. Pd. I'm not official langchain dev myself, just a common contributor.

Jun 28 '23 03:06 GMartin-dev

langchain langchain copied to clipboard

Seeking solution for combined retrievers, or retrieving from multiple vectorstores with sources, to maintain separate Namespaces.

langchain
langchain copied to clipboard