chroma icon indicating copy to clipboard operation
chroma copied to clipboard

[Bug]: Persistence within single program despite non-persistent clients

Open adrian-valente opened this issue 1 year ago • 10 comments

What happened?

When writing tests, I found that a Client instance set to be non-persistent will still remain in memory within the same program, and its collections will be recovered when a new instance is opened, breaking the expectations for a non-persistent in-memory instance. Here is a simple code to reproduce the issue:

import chromadb

setts = chromadb.config.Settings(is_persistent=False)
client = chromadb.Client(settings=setts)
coll = client.get_or_create_collection('test')
coll.add(ids=["1"], documents=["a"], embeddings=[[1]*128])
del client
del coll

client = chromadb.Client(settings=setts)
print(client.get_collection('test').get())

Versions

Chroma v0.4.24, langchain v0.1.11, python v3.11.8

Relevant log output

{'ids': ['1'], 'embeddings': None, 'metadatas': [None], 'documents': ['a'], 'uris': None, 'data': None}

adrian-valente avatar Apr 05 '24 17:04 adrian-valente

I would be happy to look into it more, but would first like to know if this type of behavior was expected or not.

adrian-valente avatar Apr 05 '24 17:04 adrian-valente

Just wanted to add that while reading the code, I noticed the presence of EphemeralClient, but which still seems to give a similar bug

adrian-valente avatar Apr 05 '24 22:04 adrian-valente

@adrian-valente, thanks for raising this. I can confirm that Chroma does ineed leave a reference to the client in cache, but also possibly in a thread local:

Screenshot 2024-04-06 at 7 23 46

While the cache can be cleaned up with SharedSystemClient.clear_system_cache(), the other is a bit tricker. We have a PR that is supposed to close and clean resources https://github.com/chroma-core/chroma/pull/1792. Please test your hypothesis with the PR and let me know if it might be possible for us to fix this problem there, too.

tazarov avatar Apr 06 '24 07:04 tazarov

@tazarov sounds fantastic, and exactly what is needed ! I have tested the PR, with the slighlty modified code:

import chromadb

setts = chromadb.config.Settings(is_persistent=False)
client = chromadb.EphemeralClient()
coll = client.get_or_create_collection('test')
coll.add(ids=["1"], documents=["a"], embeddings=[[1]*128])
client.close()
del client

client = chromadb.EphemeralClient()
print(client.get_collection('test').get())

Things however do not work as intended, and when recreating the client I get the error:

Traceback (most recent call last):
  File "/Users/avalente/Documents/repos/chroma/chromadb/api/client.py", line 438, in _validate_tenant_database
    self._admin_client.get_tenant(name=tenant)
  File "/Users/avalente/Documents/repos/chroma/chromadb/api/client.py", line 490, in get_tenant
    return self._server.get_tenant(name=name)
  File "/Users/avalente/Documents/repos/chroma/chromadb/telemetry/opentelemetry/__init__.py", line 127, in wrapper
    return f(*args, **kwargs)
  File "/Users/avalente/Documents/repos/chroma/chromadb/api/segment.py", line 158, in get_tenant
    self._raise_for_running()
  File "/Users/avalente/Documents/repos/chroma/chromadb/api/segment.py", line 114, in _raise_for_running
    raise RuntimeError("Component not running or already closed")
RuntimeError: Component not running or already closed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/avalente/Documents/repos/chroma/chromadb/__init__.py", line 119, in EphemeralClient
    return ClientCreator(settings=settings, tenant=tenant, database=database)
  File "/Users/avalente/Documents/repos/chroma/chromadb/api/client.py", line 144, in __init__
    self._validate_tenant_database(tenant=tenant, database=database)
  File "/Users/avalente/Documents/repos/chroma/chromadb/api/client.py", line 447, in _validate_tenant_database
    raise ValueError(
ValueError: Could not connect to tenant default_tenant. Are you sure it exists?

May I suggest however to add this bit of code as a test instance to the PR? I thing it would also make sense to add a call to close() in the del methods of the non-persistent client instances.

adrian-valente avatar Apr 08 '24 10:04 adrian-valente

Hi again, after looking it up I think it would require quite a bit more work to get it working right, and I don't know if it would deserve its own PR or not. My understanding is that a reference to the system, and hence the server instance, is kept in the class dictionary SharedSystemClient._identifier_to_system[]. The easiest would be to remove the corresponding entry in this dictionary once the system is stopped and corresponding clients are deleted, but I don't know if it risks breaking other things, or a design choice. An alternative would be to make sure that ephemeral clients get a new identifier each time a new one is created (which would also allow to have several unrelated ephemeral clients in the same program). In any case, this can be done in a follow-up PR if you want to move quickly with the current one.

adrian-valente avatar Apr 08 '24 14:04 adrian-valente

We have a method to clean up the cache - https://github.com/chroma-core/chroma/blob/dab637f1a7dbd448fe53b171382f55c7e642d816/chromadb/api/client.py#L105. So maybe it make sense to call the method upon closing the client to ensure the client has been cleaned up from the cache which will not cause this issue. let me give it a try

tazarov avatar Apr 08 '24 16:04 tazarov

Yes, I had overlooked this in your earlier message. Adding a call to this function leads to correct behavior indeed. So the following code works (as in correctly raises the "Collection test does not exist" error):

import chromadb

setts = chromadb.config.Settings(is_persistent=False)
client = chromadb.EphemeralClient()
coll = client.get_or_create_collection('test')
coll.add(ids=["1"], documents=["a"], embeddings=[[1]*128])
client.close()
chromadb.api.client.SharedSystemClient.clear_system_cache()
del client

client = chromadb.EphemeralClient()
print(client.get_collection('test').get())

Then I am confused about the design of SharedSystemClient: is it ok to just empty this lookup table when a client is closed or deleted? Otherwise how can the users know when to empty this cache?

adrian-valente avatar Apr 08 '24 21:04 adrian-valente

The cache is there as we need to:

  • Make sure that clients that need to be "unique" are indeed unique, e.g. HttpClient
  • Make sure clients that can repeat, like PeristentClient with different path can also be tracked

If you have a suggestion about refactoring the SharedSystemClient please submit a PR.

tazarov avatar Apr 09 '24 18:04 tazarov

Sent a PR to your feature branch, will be happy to rebase on main if you merge yours.

adrian-valente avatar Apr 10 '24 19:04 adrian-valente

@adrian-valente, thanks for this. I'll have a look today.

tazarov avatar Apr 11 '24 14:04 tazarov