chroma icon indicating copy to clipboard operation
chroma copied to clipboard

[Bug]: The same DB content returns different results when queried in different environments

Open r17652001 opened this issue 4 months ago • 2 comments

What happened?

I will query the results returned by Chromadb through the Django website. The production environment is under Docker, with Chromadb located in a predefined volum. In another environment, development is conducted outside of Docker, and Chromadb is sourced directly from the production environment. When querying the same question in these two environments, the results returned in the production environment are always Case1. However, in the non-Docker environment, results randomly alternate between Case1 and Case2. Below is the method I use for querying. I don't believe that changing the environment should result in different returns. Do you have any suggestions to offer?

chroma_client = chromadb.PersistentClient(path=os.getenv("CHROMADB_PATH"))
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
          api_key=os.getenv("OPENAI_KEY"),
          api_base=os.getenv("OPENAI_ENDPOINT"),
          api_type=os.getenv("OPENAI_TYPE"),
          api_version=os.getenv("OPENAI_VERSION"),
          deployment_id=os.getenv("OPENAI_DEPLOYMENT_EMBEDDING")
      )
collection = chroma_client.get_or_create_collection(
name="FILE", embedding_function=openai_ef)
results = collection.query(query_texts=user_query, n_results=top_n)

The initial setup for creating this Chromadb is as follows

chroma_client = chromadb.PersistentClient(path=db_path)
      openai_ef = embedding_functions.OpenAIEmbeddingFunction(
          api_key=os.getenv("OPENAI_KEY"),
          api_base=os.getenv("OPENAI_ENDPOINT"),
          api_type=os.getenv("OPENAI_TYPE"),
          api_version=os.getenv("OPENAI_VERSION"),
          deployment_id=os.getenv("OPENAI_DEPLOYMENT_EMBEDDING")
      )
collection = chroma_client.get_or_create_collection(
    name="FILE",
    metadata={"hnsw:space": "cosine"},
    embedding_function=openai_ef
)

Versions

chromadb=0.4.22,Python 3.8.0,Ubuntu 18.04.5 LTS,Django=4.1.5

Relevant log output

Case1
ids  distances                                          documents
0  202403221109153231710023   0.099198  Summary Item 1: The ......
1  202403221004070146850023   0.107203  Summary Item 1: The ......
2  202403221109153231710024   0.120497  Summary Item 2: The ......
3  202403221004070146850024   0.128477  Summary Item 2: The ......
4  202403221109153231710022   0.143013  The XY Problem ......

Case2
ids  distances                          documents
0  202403221706341333690284   0.275326  3. ......
1  202403221706341333690285   0.275660  4. ......
2  202403221706341333690313   0.279369  ......
3  202403221605332613050173   0.280373  ......
4  202403221706341333690173   0.280373  ......

r17652001 avatar Apr 02 '24 04:04 r17652001

@r17652001, one thing that I observe is that the distance metric is different between the two environments—in one, you use cosine, while in the other, you rely on the default, which is L2. If the latter is intended, then you should expect differences in results returned.

tazarov avatar Apr 02 '24 04:04 tazarov

The Chromadb created above is the initial setup. Both Environment A [Docker] and Environment B read the same Chromadb content. In theory, the results should be consistent in both environments. However, only the results queried in Environment B appear to be unstable. P.S. Both Environment A and B are tested using Django websites

r17652001 avatar Apr 02 '24 04:04 r17652001