chroma icon indicating copy to clipboard operation
chroma copied to clipboard

Add BM25 Full Text Search algorithm for hybrid search ability to Chromadb

Open HOAZ2 opened this issue 1 year ago • 1 comments

Describe the problem

Please add the ability of the full text search with algorithm like BM25 for hybrid search solutions specially in RAG solutions. Right now, many advanced RAG solutions are depended on hybrid search solutions and Chromdb is one of the most used vector databases used for semantic search applications.

Describe the proposed solution

It would be great if chroma API would support/expose full text search feature.

Alternatives considered

No response

Importance

i cannot use Chroma without it

Additional Information

No response

HOAZ2 avatar Jan 30 '24 16:01 HOAZ2

Chroma already supports full text search using the where_document feature: https://docs.trychroma.com/guides#filtering-by-document-contents

This should be much better named so people can find it more easily. We are looking into solutions for bm25 and similar.

atroyn avatar May 15 '24 23:05 atroyn

Strongly looking forward to chroma joining hybrid search!

nnnnwinder avatar Jun 21 '24 07:06 nnnnwinder

Chroma already supports full text search using the where_document feature: https://docs.trychroma.com/guides#filtering-by-document-contents

This should be much better named so people can find it more easily. We are looking into solutions for bm25 and similar.

where_document filtering is NOT full text search! I would also love to see support for BM25! :)

erikmargaronis avatar Aug 20 '24 11:08 erikmargaronis

closing in favor of https://github.com/chroma-core/chroma/issues/1330 - thanks for requesting this!

jeffchuber avatar Sep 16 '24 02:09 jeffchuber

I found best solution we move all to -> qdrant

derevyan avatar Oct 15 '24 17:10 derevyan

closing in favor of #1330 - thanks for requesting this!

Is this feature updated?

zzw1123 avatar Dec 27 '24 08:12 zzw1123

I proposed a solution for those who are using ChromaDB with Langchain. I will repeat here the answer I gave in issue #1330 hoping that it will help you:

For those who have integrated the ChromaDB client with the Langchain framework, I used the following approach to implement Hybrid search (Vector Search + BM25Retriever):

from langchain_chroma import Chroma
import chromadb
from chromadb.config import Settings
from langchain_openai import OpenAIEmbeddings
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever
from langchain_core.documents import Document
from langgraph.graph import START, StateGraph
from typing_extensions import TypedDict
 
 
# Assuming that you have instantiated Chroma client and integrate it into Langchain (below is an example)
“””
persistent_client = chromadb.PersistentClient(path=”./test”, settings=Settings(allow_reset=True))
collection = persistent_client.get_or_create_collection(
            name=”example”,
            metadata={
                "hnsw:space": "cosine",
                # you can add other HNSW parameters if you want
            }
        )
 
chroma = Chroma(
                        client=persistent_client,
                        collection_name=collection.name,
                        embedding_function= OpenAIEmbeddings(model="text-embedding-3-large"))
“””
 
def hybrid_search(self, query: str, k: int = 5):
        """Perform a Hybrid Search (similarity_search + BM25Retriever) in the collection."""
        # Get all raw documents from the ChromaDB
        raw_docs = chroma.get(include=["documents", "metadatas"])
        # Convert them in Document object
        documents = [
            Document(page_content=doc, metadata=meta)
            for doc, meta in zip(raw_docs["documents"], raw_docs["metadatas"])
        ]
       # Create BM25Retriever from the documents
        bm25_retriever = BM25Retriever.from_documents(documents=documents, k=k)
      # Create vector search retriever from ChromaDB instance
        similarity_search_retriever = self.chroma.as_retriever(
                search_type="similarity",
                search_kwargs={'k': k}
            )
       # Ensemble the retrievers using Langchain’s EnsembleRetriever Object
        ensemble_retriever = EnsembleRetriever(retrievers=[similarity_search_retriever, bm25_retriever], weights=[0.5, 0.5])
        # Retrieve k relevant documents for the query
        return ensemble_retriever.invoke(query) # If needed, we can use ainvoke(query) method to retrieve the docs asynchrounously
 
# Call hybrid_search() method
class State(TypedDict):
    question: str
    context: List[Document]
    answer: str
 
# --- Define Graph Nodes (retrieve, generate, etc.) ---
def retrieve(state: State) -> dict:
    retrieved_docs = vector_store.hybrid_search(state["question"], 3)
    return {"context": retrieved_docs}
 

Note: The above code is just a sequence that contains exclusively the retrieval component to be further integrated into the application structure and RAG flow.

Also, I hope this example will help you in the implementation of Hybrid Search until it will be implemented by Chroma. If you know of a better approach or if a clearer context is needed, please let me know.

PatrickDiallo23 avatar Mar 01 '25 14:03 PatrickDiallo23