chroma
chroma copied to clipboard
Add BM25 Full Text Search algorithm for hybrid search ability to Chromadb
Describe the problem
Please add the ability of the full text search with algorithm like BM25 for hybrid search solutions specially in RAG solutions. Right now, many advanced RAG solutions are depended on hybrid search solutions and Chromdb is one of the most used vector databases used for semantic search applications.
Describe the proposed solution
It would be great if chroma API would support/expose full text search feature.
Alternatives considered
No response
Importance
i cannot use Chroma without it
Additional Information
No response
Chroma already supports full text search using the where_document
feature: https://docs.trychroma.com/guides#filtering-by-document-contents
This should be much better named so people can find it more easily. We are looking into solutions for bm25 and similar.
Strongly looking forward to chroma joining hybrid search!
Chroma already supports full text search using the
where_document
feature: https://docs.trychroma.com/guides#filtering-by-document-contentsThis should be much better named so people can find it more easily. We are looking into solutions for bm25 and similar.
where_document filtering is NOT full text search! I would also love to see support for BM25! :)
closing in favor of https://github.com/chroma-core/chroma/issues/1330 - thanks for requesting this!
I found best solution we move all to -> qdrant
closing in favor of #1330 - thanks for requesting this!
Is this feature updated?
I proposed a solution for those who are using ChromaDB with Langchain. I will repeat here the answer I gave in issue #1330 hoping that it will help you:
For those who have integrated the ChromaDB client with the Langchain framework, I used the following approach to implement Hybrid search (Vector Search + BM25Retriever):
from langchain_chroma import Chroma
import chromadb
from chromadb.config import Settings
from langchain_openai import OpenAIEmbeddings
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever
from langchain_core.documents import Document
from langgraph.graph import START, StateGraph
from typing_extensions import TypedDict
# Assuming that you have instantiated Chroma client and integrate it into Langchain (below is an example)
“””
persistent_client = chromadb.PersistentClient(path=”./test”, settings=Settings(allow_reset=True))
collection = persistent_client.get_or_create_collection(
name=”example”,
metadata={
"hnsw:space": "cosine",
# you can add other HNSW parameters if you want
}
)
chroma = Chroma(
client=persistent_client,
collection_name=collection.name,
embedding_function= OpenAIEmbeddings(model="text-embedding-3-large"))
“””
def hybrid_search(self, query: str, k: int = 5):
"""Perform a Hybrid Search (similarity_search + BM25Retriever) in the collection."""
# Get all raw documents from the ChromaDB
raw_docs = chroma.get(include=["documents", "metadatas"])
# Convert them in Document object
documents = [
Document(page_content=doc, metadata=meta)
for doc, meta in zip(raw_docs["documents"], raw_docs["metadatas"])
]
# Create BM25Retriever from the documents
bm25_retriever = BM25Retriever.from_documents(documents=documents, k=k)
# Create vector search retriever from ChromaDB instance
similarity_search_retriever = self.chroma.as_retriever(
search_type="similarity",
search_kwargs={'k': k}
)
# Ensemble the retrievers using Langchain’s EnsembleRetriever Object
ensemble_retriever = EnsembleRetriever(retrievers=[similarity_search_retriever, bm25_retriever], weights=[0.5, 0.5])
# Retrieve k relevant documents for the query
return ensemble_retriever.invoke(query) # If needed, we can use ainvoke(query) method to retrieve the docs asynchrounously
# Call hybrid_search() method
class State(TypedDict):
question: str
context: List[Document]
answer: str
# --- Define Graph Nodes (retrieve, generate, etc.) ---
def retrieve(state: State) -> dict:
retrieved_docs = vector_store.hybrid_search(state["question"], 3)
return {"context": retrieved_docs}
Note: The above code is just a sequence that contains exclusively the retrieval component to be further integrated into the application structure and RAG flow.
Also, I hope this example will help you in the implementation of Hybrid Search until it will be implemented by Chroma. If you know of a better approach or if a clearer context is needed, please let me know.