chroma icon indicating copy to clipboard operation
chroma copied to clipboard

[Feature Request]: Hybrid Search with BM25

Open xinyuli1204 opened this issue 1 year ago • 19 comments

Describe the problem

Can you provide keyword search combined with semantic search like other vector store?

Describe the proposed solution

keyword: BM25

Alternatives considered

No response

Importance

would make my life easier

Additional Information

No response

xinyuli1204 avatar Nov 03 '23 22:11 xinyuli1204

Agree! Hybrid search is the ultimate solution. Weaviate has hybrid search which combines BM25 and vector search. Hope Chroma can do it too.

PTTrazavi avatar Nov 09 '23 13:11 PTTrazavi

Still no news on this?

Soremojinsen avatar Mar 20 '24 10:03 Soremojinsen

We are waiting for it!

Dramulas avatar Mar 26 '24 06:03 Dramulas

any updates?

RafikYH avatar May 16 '24 12:05 RafikYH

Any plans to support this? This would be extremely useful for us.

sanketsynthexailabs avatar Jul 30 '24 02:07 sanketsynthexailabs

+1

han508 avatar Aug 20 '24 18:08 han508

+1

Leeaandrob avatar Aug 21 '24 02:08 Leeaandrob

+1

GurjotTatras avatar Sep 11 '24 11:09 GurjotTatras

This is probably the biggest upgrade chroma can have

debkanchan avatar Sep 28 '24 19:09 debkanchan

up

benlyazid avatar Oct 07 '24 10:10 benlyazid

Hope this gets prioritized.

sakethram18 avatar Oct 07 '24 23:10 sakethram18

waiting for Hybrid search in ChromaDB

GurjotTatras avatar Oct 08 '24 04:10 GurjotTatras

+1

derevyan avatar Oct 15 '24 10:10 derevyan

+1

Einengutenmorgen avatar Oct 15 '24 13:10 Einengutenmorgen

I found best solution we move all to -> qdrant

derevyan avatar Oct 15 '24 17:10 derevyan

please bruh drop this feature my customer is yelling at me bruh Sad Chihuahua Meme

debkanchan avatar Nov 28 '24 09:11 debkanchan

Hope it get implemented soon, will make our life easier.

mlrana avatar Nov 28 '24 10:11 mlrana

Looking forward to this feature!!!!!

zzw1123 avatar Dec 27 '24 08:12 zzw1123

Yes plz!

sriharshaguthikonda avatar Jan 05 '25 04:01 sriharshaguthikonda

really important feature, hope it gets out there soon!

maaganm-hub avatar Jan 17 '25 07:01 maaganm-hub

+1

joodaloop avatar Jan 24 '25 12:01 joodaloop

any update?

jpzhangvincent avatar Jan 24 '25 19:01 jpzhangvincent

+1

trapsidanadir avatar Jan 29 '25 23:01 trapsidanadir

I had to write this on my own 1 year ago. I store my own BM25 index and implement hybrid with RRF. Would love to see ChromaDB do the same.

trentniemeyer avatar Feb 02 '25 17:02 trentniemeyer

I had to write this on my own 1 year ago. I store my own BM25 index and implement hybrid with RRF. Would love to see ChromaDB do the same.

@trentniemeyer Is your implementation openly available?

tallesl avatar Feb 05 '25 03:02 tallesl

Not exactly BM25 but everyone should check out these section of docs: https://docs.trychroma.com/docs/querying-collections/full-text-search https://docs.trychroma.com/reference/python/collection#query

Essentially you can pass required keywords yourself as filter. Again not exactly BM25 but should help most people. Thanks @jeffchuber for pointing this out!

debkanchan avatar Feb 05 '25 17:02 debkanchan

@tallesl it's not available (mostly cause I write messy/hacky code), but I'm happy to share with you want I did. RRF is quite simple as is BM25. I wrote my own BM25 index, but just heard about this: https://huggingface.co/blog/xhluca/bm25s

Here is my RRF implementation (but I added date decay)

` def __reciprocal_rank_fusion_date_decay(list1, list2, k=60, decay_factor=0.0, limit=10): """ Apply Reciprocal Rank Fusion on two lists of tuples, where the rank is determined by the order in the list. The second item of the tuples should match the keys across both lists.

    :param list1: First ranking list of tuples (key, associated info, year)
    :param list2: Second ranking list of tuples (key, associated info, year)
    :param k: Constant used in RRF formula, typically set to 60
    :param decay_factor: Factor controlling the data decay, a value between 0 and 1 (0 means no decay)
    :return: List of tuples sorted based on RRF score with data decay, maintaining associated info
    """
    rrf_scores = {}
    info_dict = {}

    current_year = datetime.now().year

    # Process the first list of tuples
    for rank, (key, info, year) in enumerate(list1, start=1):
        delta_years = current_year - year
        rrf_scores[key] = rrf_scores.get(key, 0) + (1 / (k + rank)) * (1 - decay_factor) ** delta_years
        info_dict[key] = info

    # Process the second list of tuples
    for rank, (key, info, year) in enumerate(list2, start=1):
        delta_years = current_year - year
        rrf_scores[key] = rrf_scores.get(key, 0) + (1 / (k + rank)) * (1 - decay_factor) ** delta_years
        # Ensure the associated info matches across both lists for the same key
        if key in info_dict and info_dict[key] != info:
            raise ValueError(f"Associated information for key '{key}' does not match between lists.")
        info_dict[key] = info

    # Sort items based on RRF score
    sorted_items = sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)

    # Create sorted list of tuples with associated info
    sorted_tuples = [(item, info_dict[item]) for item, _ in sorted_items[:limit]]

    return sorted_tuples`

trentniemeyer avatar Feb 05 '25 20:02 trentniemeyer

https://docs.trychroma.com/docs/querying-collections/full-text-search

How would I do this with full text: I want to search my entire collection (single collection) for all documents with the keyword "Administrator"

I don't have a requirement to create more embedding, I just want to search the documents already available in the collection.

duffybelfield avatar Feb 05 '25 21:02 duffybelfield

Hi,

For those who have integrated the ChromaDB client with the Langchain framework, I used the following approach to implement the Hybrid search (Vector Search + BM25Retriever):

from langchain_chroma import Chroma
import chromadb
from chromadb.config import Settings
from langchain_openai import OpenAIEmbeddings
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever
from langchain_core.documents import Document
from langgraph.graph import START, StateGraph
from typing_extensions import TypedDict
 
 
# Assuming that you have instantiated Chroma client and integrate it into Langchain (below is an example)
“””
persistent_client = chromadb.PersistentClient(path=”./test”, settings=Settings(allow_reset=True))
collection = persistent_client.get_or_create_collection(
            name=”example”,
            metadata={
                "hnsw:space": "cosine",
                # you can add other HNSW parameters if you want
            }
        )
 
chroma = Chroma(
                        client=persistent_client,
                        collection_name=collection.name,
                        embedding_function= OpenAIEmbeddings(model="text-embedding-3-large"))
“””
 
def hybrid_search(self, query: str, k: int = 5):
        """Perform a Hybrid Search (similarity_search + BM25Retriever) in the collection."""
        # Get all raw documents from the ChromaDB
        raw_docs = chroma.get(include=["documents", "metadatas"])
        # Convert them in Document object
        documents = [
            Document(page_content=doc, metadata=meta)
            for doc, meta in zip(raw_docs["documents"], raw_docs["metadatas"])
        ]
       # Create BM25Retriever from the documents
        bm25_retriever = BM25Retriever.from_documents(documents=documents, k=k)
      # Create vector search retriever from ChromaDB instance
        similarity_search_retriever = self.chroma.as_retriever(
                search_type="similarity",
                search_kwargs={'k': k}
            )
       # Ensemble the retrievers using Langchain’s EnsembleRetriever Object
        ensemble_retriever = EnsembleRetriever(retrievers=[similarity_search_retriever, bm25_retriever], weights=[0.5, 0.5])
        # Retrieve k relevant documents for the query
        return ensemble_retriever.invoke(query) # If needed, we can use ainvoke(query) method to retrieve the docs asynchrounously
 
# Call hybrid_search() method
class State(TypedDict):
    question: str
    context: List[Document]
    answer: str
 
# --- Define Graph Nodes (retrieve, generate, etc.) ---
def retrieve(state: State) -> dict:
    retrieved_docs = vector_store.hybrid_search(state["question"], 3)
    return {"context": retrieved_docs}
 

Note: The above code is just a sequence that contains exclusively the retrieval component to be further integrated into the application structure and RAG flow.

Also, I hope this example will help you in the implementation of Hybrid Search until it is implemented by Chroma. If you know of a better approach or if a clearer context is needed, please let me know.

PatrickDiallo23 avatar Mar 01 '25 14:03 PatrickDiallo23