chroma icon indicating copy to clipboard operation
chroma copied to clipboard

[Bug]: Problem with data relevance when working with the database in different processes

Open yegor-matveyas opened this issue 1 year ago • 1 comments

What happened?

There is a web application that uses langchain and chroma. If a script is run in parallel with this application that deletes some data from the database, this data is partially available in the main application, especially in the langchain's max_marginal_relevance_search_by_vector function, which executes a query using the chroma's .query method . What's interesting is that only ids and embeddings of deleted records are returned, however documents and metadatas are None.

The result of such a query looks something like this:

result = {
  "ids":  [['9232', '9133', '9392', '9132', '9233', '9037', '9006', '9394', '9134', '9236', '9234', '9395', '9131', '9007', '9396', '9393', '8952', '8954', '8953', '9235']],
  "documents": [[None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, 'document #17', 'document #18', 'document #19', None]],
  "embeddings": [[0.0073, ...], [-0.0076, ...], [0.0086, ...], [-0.0077, ...], [-0.0007, ...], [-0.0008, ...], [0.0081, ...], [0.0047, ...], [-0.0078, ...], [0.0032, ...], [0.0028, ...], [0.0040, ...], [-0.0113, ...], [0.0113, ...], [0.0016, ...], [0.0088, [0.0001, ...], [0.0025, ...], [-0.0012, ...], [0.0050, ...]],
  "metadatas": [[None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, {"url": "https://example.com/"}, {"url": "https://example.com/"}, {"url": "https://example.com/"}, None]],
}

Versions

Chroma v0.4.24, Python 3.12.1, Windows 11

Relevant log output

No response

yegor-matveyas avatar Apr 12 '24 16:04 yegor-matveyas

@yegor-matveyas, do you have the application code that deletes the records from Chroma? Can you share it? Is there any chance that the application only does partial deletes?

tazarov avatar Apr 14 '24 05:04 tazarov