chroma
chroma copied to clipboard
Clean up embedding functions
There are other issues that track this, but opening up a new issue and will link those in.
The way we handle embedding functions is currently borked. Users have to pass a matching embedding function anytime that that they do get_collection and list_collections is even more broken.
We don't want to store embedding functions serverside however.
One idea is a "embedding function registry" where we can store a string server-side that can "automatically lookup" client-side from the registry which embedding model should be used.
More discussion needed.
I think I have encountered an issue of mismatch embedding function when I was using list_collections. Basically, I was running a docker container of chromadb which acted as a server. After that I ran a simple python script which interacted with the chromadb container by creating a new collection with VertexAI embedding and retrieving all collections created. As I was printing out the collections that I retrieved, I realised that somehow the collection's embedding name was defaulted to the ONNXMiniLM_L6_V2 when instead it should be GoogleVertexEmbeddingFunction.
CHROMA_CLIENT = chromadb.HttpClient(host="localhost",
port="8000")
EMBEDDING_FUNCTION = embedding_functions.GoogleVertexEmbeddingFunction(api_key=os.getenv("API_KEY"),
project_id=os.getenv("PROJECT_ID"))
new_collection_name = 'abc'
new_collection = CHROMA_CLIENT.get_or_create_collection(name=new_collection_name,
embedding_function=EMBEDDING_FUNCTION)
collections = CHROMA_CLIENT.list_collections()
for collection in collections:
name = collection.name
id = collection.id
print(collection._embedding_function)
embed_func = collection._embedding_function.__class__.__name__
result_collections.append({
"collection_name": name,
"collection_id": id,
"collection_embedding_function" : embed_func
})
One idea is a "embedding function registry" where we can store a string server-side that can "automatically lookup" client-side from the registry which embedding model should be used.
I was doing something like this while prototyping, storing an embedding_function string in the collection metadata
thanks @Dev317 and @Russell-Pollari - appreciate the added context and ideas!
@HammadB tagging you for now since this tracks with the CIP youve been working on
Just want to add to this issue that its not well documented that the same embedding function has to be passed around.
Perhaps in the interim this could be referenced somewhere.