chroma
chroma copied to clipboard
[Feature Request]: Hybrid Search with BM25
Describe the problem
Can you provide keyword search combined with semantic search like other vector store?
Describe the proposed solution
keyword: BM25
Alternatives considered
No response
Importance
would make my life easier
Additional Information
No response
Agree! Hybrid search is the ultimate solution. Weaviate has hybrid search which combines BM25 and vector search. Hope Chroma can do it too.
Still no news on this?
We are waiting for it!
any updates?
Any plans to support this? This would be extremely useful for us.
+1
+1
+1
This is probably the biggest upgrade chroma can have
up
Hope this gets prioritized.
waiting for Hybrid search in ChromaDB
+1
+1
I found best solution we move all to -> qdrant
please bruh drop this feature my customer is yelling at me bruh
Hope it get implemented soon, will make our life easier.
Looking forward to this feature!!!!!
Yes plz!
really important feature, hope it gets out there soon!
+1
any update?
+1
I had to write this on my own 1 year ago. I store my own BM25 index and implement hybrid with RRF. Would love to see ChromaDB do the same.
I had to write this on my own 1 year ago. I store my own BM25 index and implement hybrid with RRF. Would love to see ChromaDB do the same.
@trentniemeyer Is your implementation openly available?
Not exactly BM25 but everyone should check out these section of docs: https://docs.trychroma.com/docs/querying-collections/full-text-search https://docs.trychroma.com/reference/python/collection#query
Essentially you can pass required keywords yourself as filter. Again not exactly BM25 but should help most people. Thanks @jeffchuber for pointing this out!
@tallesl it's not available (mostly cause I write messy/hacky code), but I'm happy to share with you want I did. RRF is quite simple as is BM25. I wrote my own BM25 index, but just heard about this: https://huggingface.co/blog/xhluca/bm25s
Here is my RRF implementation (but I added date decay)
` def __reciprocal_rank_fusion_date_decay(list1, list2, k=60, decay_factor=0.0, limit=10): """ Apply Reciprocal Rank Fusion on two lists of tuples, where the rank is determined by the order in the list. The second item of the tuples should match the keys across both lists.
:param list1: First ranking list of tuples (key, associated info, year)
:param list2: Second ranking list of tuples (key, associated info, year)
:param k: Constant used in RRF formula, typically set to 60
:param decay_factor: Factor controlling the data decay, a value between 0 and 1 (0 means no decay)
:return: List of tuples sorted based on RRF score with data decay, maintaining associated info
"""
rrf_scores = {}
info_dict = {}
current_year = datetime.now().year
# Process the first list of tuples
for rank, (key, info, year) in enumerate(list1, start=1):
delta_years = current_year - year
rrf_scores[key] = rrf_scores.get(key, 0) + (1 / (k + rank)) * (1 - decay_factor) ** delta_years
info_dict[key] = info
# Process the second list of tuples
for rank, (key, info, year) in enumerate(list2, start=1):
delta_years = current_year - year
rrf_scores[key] = rrf_scores.get(key, 0) + (1 / (k + rank)) * (1 - decay_factor) ** delta_years
# Ensure the associated info matches across both lists for the same key
if key in info_dict and info_dict[key] != info:
raise ValueError(f"Associated information for key '{key}' does not match between lists.")
info_dict[key] = info
# Sort items based on RRF score
sorted_items = sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)
# Create sorted list of tuples with associated info
sorted_tuples = [(item, info_dict[item]) for item, _ in sorted_items[:limit]]
return sorted_tuples`
https://docs.trychroma.com/docs/querying-collections/full-text-search
How would I do this with full text:
I want to search my entire collection (single collection) for all documents with the keyword "Administrator"
I don't have a requirement to create more embedding, I just want to search the documents already available in the collection.
Hi,
For those who have integrated the ChromaDB client with the Langchain framework, I used the following approach to implement the Hybrid search (Vector Search + BM25Retriever):
from langchain_chroma import Chroma
import chromadb
from chromadb.config import Settings
from langchain_openai import OpenAIEmbeddings
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever
from langchain_core.documents import Document
from langgraph.graph import START, StateGraph
from typing_extensions import TypedDict
# Assuming that you have instantiated Chroma client and integrate it into Langchain (below is an example)
“””
persistent_client = chromadb.PersistentClient(path=”./test”, settings=Settings(allow_reset=True))
collection = persistent_client.get_or_create_collection(
name=”example”,
metadata={
"hnsw:space": "cosine",
# you can add other HNSW parameters if you want
}
)
chroma = Chroma(
client=persistent_client,
collection_name=collection.name,
embedding_function= OpenAIEmbeddings(model="text-embedding-3-large"))
“””
def hybrid_search(self, query: str, k: int = 5):
"""Perform a Hybrid Search (similarity_search + BM25Retriever) in the collection."""
# Get all raw documents from the ChromaDB
raw_docs = chroma.get(include=["documents", "metadatas"])
# Convert them in Document object
documents = [
Document(page_content=doc, metadata=meta)
for doc, meta in zip(raw_docs["documents"], raw_docs["metadatas"])
]
# Create BM25Retriever from the documents
bm25_retriever = BM25Retriever.from_documents(documents=documents, k=k)
# Create vector search retriever from ChromaDB instance
similarity_search_retriever = self.chroma.as_retriever(
search_type="similarity",
search_kwargs={'k': k}
)
# Ensemble the retrievers using Langchain’s EnsembleRetriever Object
ensemble_retriever = EnsembleRetriever(retrievers=[similarity_search_retriever, bm25_retriever], weights=[0.5, 0.5])
# Retrieve k relevant documents for the query
return ensemble_retriever.invoke(query) # If needed, we can use ainvoke(query) method to retrieve the docs asynchrounously
# Call hybrid_search() method
class State(TypedDict):
question: str
context: List[Document]
answer: str
# --- Define Graph Nodes (retrieve, generate, etc.) ---
def retrieve(state: State) -> dict:
retrieved_docs = vector_store.hybrid_search(state["question"], 3)
return {"context": retrieved_docs}
Note: The above code is just a sequence that contains exclusively the retrieval component to be further integrated into the application structure and RAG flow.
Also, I hope this example will help you in the implementation of Hybrid Search until it is implemented by Chroma. If you know of a better approach or if a clearer context is needed, please let me know.