chroma
chroma copied to clipboard
[Bug]: Persistence within single program despite non-persistent clients
What happened?
When writing tests, I found that a Client instance set to be non-persistent will still remain in memory within the same program, and its collections will be recovered when a new instance is opened, breaking the expectations for a non-persistent in-memory instance. Here is a simple code to reproduce the issue:
import chromadb
setts = chromadb.config.Settings(is_persistent=False)
client = chromadb.Client(settings=setts)
coll = client.get_or_create_collection('test')
coll.add(ids=["1"], documents=["a"], embeddings=[[1]*128])
del client
del coll
client = chromadb.Client(settings=setts)
print(client.get_collection('test').get())
Versions
Chroma v0.4.24, langchain v0.1.11, python v3.11.8
Relevant log output
{'ids': ['1'], 'embeddings': None, 'metadatas': [None], 'documents': ['a'], 'uris': None, 'data': None}
I would be happy to look into it more, but would first like to know if this type of behavior was expected or not.
Just wanted to add that while reading the code, I noticed the presence of EphemeralClient, but which still seems to give a similar bug
@adrian-valente, thanks for raising this. I can confirm that Chroma does ineed leave a reference to the client in cache, but also possibly in a thread local:
While the cache can be cleaned up with SharedSystemClient.clear_system_cache(), the other is a bit tricker. We have a PR that is supposed to close and clean resources https://github.com/chroma-core/chroma/pull/1792. Please test your hypothesis with the PR and let me know if it might be possible for us to fix this problem there, too.
@tazarov sounds fantastic, and exactly what is needed ! I have tested the PR, with the slighlty modified code:
import chromadb
setts = chromadb.config.Settings(is_persistent=False)
client = chromadb.EphemeralClient()
coll = client.get_or_create_collection('test')
coll.add(ids=["1"], documents=["a"], embeddings=[[1]*128])
client.close()
del client
client = chromadb.EphemeralClient()
print(client.get_collection('test').get())
Things however do not work as intended, and when recreating the client I get the error:
Traceback (most recent call last):
File "/Users/avalente/Documents/repos/chroma/chromadb/api/client.py", line 438, in _validate_tenant_database
self._admin_client.get_tenant(name=tenant)
File "/Users/avalente/Documents/repos/chroma/chromadb/api/client.py", line 490, in get_tenant
return self._server.get_tenant(name=name)
File "/Users/avalente/Documents/repos/chroma/chromadb/telemetry/opentelemetry/__init__.py", line 127, in wrapper
return f(*args, **kwargs)
File "/Users/avalente/Documents/repos/chroma/chromadb/api/segment.py", line 158, in get_tenant
self._raise_for_running()
File "/Users/avalente/Documents/repos/chroma/chromadb/api/segment.py", line 114, in _raise_for_running
raise RuntimeError("Component not running or already closed")
RuntimeError: Component not running or already closed
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/avalente/Documents/repos/chroma/chromadb/__init__.py", line 119, in EphemeralClient
return ClientCreator(settings=settings, tenant=tenant, database=database)
File "/Users/avalente/Documents/repos/chroma/chromadb/api/client.py", line 144, in __init__
self._validate_tenant_database(tenant=tenant, database=database)
File "/Users/avalente/Documents/repos/chroma/chromadb/api/client.py", line 447, in _validate_tenant_database
raise ValueError(
ValueError: Could not connect to tenant default_tenant. Are you sure it exists?
May I suggest however to add this bit of code as a test instance to the PR? I thing it would also make sense to add a call to close() in the del methods of the non-persistent client instances.
Hi again, after looking it up I think it would require quite a bit more work to get it working right, and I don't know if it would deserve its own PR or not. My understanding is that a reference to the system, and hence the server instance, is kept in the class dictionary SharedSystemClient._identifier_to_system[]. The easiest would be to remove the corresponding entry in this dictionary once the system is stopped and corresponding clients are deleted, but I don't know if it risks breaking other things, or a design choice. An alternative would be to make sure that ephemeral clients get a new identifier each time a new one is created (which would also allow to have several unrelated ephemeral clients in the same program). In any case, this can be done in a follow-up PR if you want to move quickly with the current one.
We have a method to clean up the cache - https://github.com/chroma-core/chroma/blob/dab637f1a7dbd448fe53b171382f55c7e642d816/chromadb/api/client.py#L105. So maybe it make sense to call the method upon closing the client to ensure the client has been cleaned up from the cache which will not cause this issue. let me give it a try
Yes, I had overlooked this in your earlier message. Adding a call to this function leads to correct behavior indeed. So the following code works (as in correctly raises the "Collection test does not exist" error):
import chromadb
setts = chromadb.config.Settings(is_persistent=False)
client = chromadb.EphemeralClient()
coll = client.get_or_create_collection('test')
coll.add(ids=["1"], documents=["a"], embeddings=[[1]*128])
client.close()
chromadb.api.client.SharedSystemClient.clear_system_cache()
del client
client = chromadb.EphemeralClient()
print(client.get_collection('test').get())
Then I am confused about the design of SharedSystemClient: is it ok to just empty this lookup table when a client is closed or deleted? Otherwise how can the users know when to empty this cache?
The cache is there as we need to:
- Make sure that clients that need to be "unique" are indeed unique, e.g. HttpClient
- Make sure clients that can repeat, like PeristentClient with different
pathcan also be tracked
If you have a suggestion about refactoring the SharedSystemClient please submit a PR.
Sent a PR to your feature branch, will be happy to rebase on main if you merge yours.
@adrian-valente, thanks for this. I'll have a look today.