chroma icon indicating copy to clipboard operation
chroma copied to clipboard

Clean up embedding functions

Open jeffchuber opened this issue 2 years ago • 5 comments

There are other issues that track this, but opening up a new issue and will link those in.

The way we handle embedding functions is currently borked. Users have to pass a matching embedding function anytime that that they do get_collection and list_collections is even more broken.

We don't want to store embedding functions serverside however.

One idea is a "embedding function registry" where we can store a string server-side that can "automatically lookup" client-side from the registry which embedding model should be used.

More discussion needed.

jeffchuber avatar Jul 26 '23 05:07 jeffchuber

I think I have encountered an issue of mismatch embedding function when I was using list_collections. Basically, I was running a docker container of chromadb which acted as a server. After that I ran a simple python script which interacted with the chromadb container by creating a new collection with VertexAI embedding and retrieving all collections created. As I was printing out the collections that I retrieved, I realised that somehow the collection's embedding name was defaulted to the ONNXMiniLM_L6_V2 when instead it should be GoogleVertexEmbeddingFunction.

CHROMA_CLIENT = chromadb.HttpClient(host="localhost",
                                    port="8000")

EMBEDDING_FUNCTION = embedding_functions.GoogleVertexEmbeddingFunction(api_key=os.getenv("API_KEY"),
                                                   project_id=os.getenv("PROJECT_ID"))

new_collection_name = 'abc'
new_collection = CHROMA_CLIENT.get_or_create_collection(name=new_collection_name,
                                                            embedding_function=EMBEDDING_FUNCTION)

collections = CHROMA_CLIENT.list_collections()

for collection in collections:
            name = collection.name
            id = collection.id
            print(collection._embedding_function)
            embed_func = collection._embedding_function.__class__.__name__
            result_collections.append({
                "collection_name": name,
                "collection_id": id,
                "collection_embedding_function" : embed_func
            })

Dev317 avatar Jul 26 '23 12:07 Dev317

One idea is a "embedding function registry" where we can store a string server-side that can "automatically lookup" client-side from the registry which embedding model should be used.

I was doing something like this while prototyping, storing an embedding_function string in the collection metadata

Russell-Pollari avatar Jul 26 '23 17:07 Russell-Pollari

thanks @Dev317 and @Russell-Pollari - appreciate the added context and ideas!

jeffchuber avatar Jul 28 '23 21:07 jeffchuber

@HammadB tagging you for now since this tracks with the CIP youve been working on

jeffchuber avatar Sep 06 '23 03:09 jeffchuber

Just want to add to this issue that its not well documented that the same embedding function has to be passed around.

Perhaps in the interim this could be referenced somewhere.

alex-goswag avatar Apr 09 '24 16:04 alex-goswag