langchain icon indicating copy to clipboard operation
langchain copied to clipboard

MMR Search in Chroma not working, typo suspected

Open jppaolim opened this issue 1 year ago • 2 comments

System Info

Langchain v0.0.171 Mac OS

Who can help?

@jeffchuber

Information

  • [ ] The official example notebooks/scripts
  • [X] My own modified scripts

Related Components

  • [ ] LLMs/Chat Models
  • [ ] Embedding Models
  • [ ] Prompts / Prompt Templates / Prompt Selectors
  • [ ] Output Parsers
  • [ ] Document Loaders
  • [X] Vector Stores / Retrievers
  • [ ] Memory
  • [ ] Agents / Agent Executors
  • [ ] Tools / Toolkits
  • [ ] Chains
  • [ ] Callbacks/Tracing
  • [ ] Async

Reproduction

If I initialise a chroma database and then retriever

db = Chroma.from_documents(texts, embeddings_function(), 
                    metadatas=[{"source": str(i)} for i in range(len(texts))], persist_directory=PERSIST_DIRECTORY)    

querybase = db.as_retriever(search_type="mmr", search_kwargs={"k":3, "lambda_mult":1})

retrieved files are then identical whether I pass 0.1 or 0.9 as lambda_mult parameter.

Expected behavior

I expect different file.

Digging into the code there is a typo I think in langchain.vectorstores.chroma last line should be lambda_mult and not lambda_mul :

As this is my first time, not sure how to properly suggest or test :)

def max_marginal_relevance_search(
        self,
        query: str,
        k: int = 4,
        fetch_k: int = 20,
        lambda_mult: float = 0.5,
        filter: Optional[Dict[str, str]] = None,
        **kwargs: Any,
    ) -> List[Document]:
        """Return docs selected using the maximal marginal relevance.
        Maximal marginal relevance optimizes for similarity to query AND diversity
        among selected documents.
        Args:
            query: Text to look up documents similar to.
            k: Number of Documents to return. Defaults to 4.
            fetch_k: Number of Documents to fetch to pass to MMR algorithm.
            lambda_mult: Number between 0 and 1 that determines the degree
                        of diversity among the results with 0 corresponding
                        to maximum diversity and 1 to minimum diversity.
                        Defaults to 0.5.
            filter (Optional[Dict[str, str]]): Filter by metadata. Defaults to None.
        Returns:
            List of Documents selected by maximal marginal relevance.
        """
        if self._embedding_function is None:
            raise ValueError(
                "For MMR search, you must specify an embedding function on" "creation."
            )

        embedding = self._embedding_function.embed_query(query)
        docs = self.max_marginal_relevance_search_by_vector(
            embedding, k, fetch_k, lambda_mul=lambda_mult, filter=filter
        )
        return docs

jppaolim avatar May 17 '23 14:05 jppaolim

@hwchase17 🤔 any thoughts here? I didn't write this but happy to help.

jeffchuber avatar May 17 '23 16:05 jeffchuber

Chroma.from_documents() takes Document object as a parameter not text and metadata separately. you have to use Chroma.from_texts() according to your use-case because your are providing texts and metadata separately.

Satyam-79 avatar May 21 '23 03:05 Satyam-79

In the end, I decided to use something else to get around ... but I still think the last line of this python code is suspect with a typo :)

jppaolim avatar Jun 11 '23 22:06 jppaolim