[Feature Request]: Check the response format from embedding functions and error if it's wrong
Describe the problem
I've seen a few users in Discord with mysterious-looking error messages deep in our code when they try to add -- most recently, KeyError: 0 in segment.py: https://discord.com/channels/1073293645303795742/1074711446589542552/1183352033902862366 . These are caused by an external embedding function not returning data in the format we expect, either because the embedding provider changes or we change.
We should have validation code which checks that externally provided embeddings conform to our expected format. We could give a much more helpful error message if not.
Describe the proposed solution
Code to make sure that embeddings are formatted how we expect. Probably in the base EmbeddingFunction class but I'm not attached to any particular implementation.
Alternatives considered
The current state of the world.
Importance
would make my life easier
Additional Information
No response
I would love to work on this, if possible can anyone guide me a little towards how I should go about this?
@hey-sagar start by looking at https://github.com/chroma-core/chroma/blob/main/DEVELOP.md if you haven't
@beggers How is a KeyError being raised for accessing item on a List datatype in this problem?
Python datatypes are only hints which must be manually checked before runtime. They don't actually enforce anything at runtime. In other words, we don't know that the data being accessed is actually a List.
Right now it's possible for someone to write an EmbeddingFunction which returns...anything. In our type hints we assume that EmbeddingFunctions return Lists because they're supposed to, but if a user writes one which doesn't they'll get the KeyError
Thanks @beggers for the reply. If I understand correctly, the embedding function in this particular instance i.e., embedding_functions.HuggingFaceEmbeddingFunction does return embeddings in the correct format. It was the model_name that was incorrectly passed during the class instantiation because of which the embedding function returned the following
{
'error': 'Model BAAI_bge-base-en-v1.5 does not exist'
}
and hence caused the KeyError. Passing the correct model_name should give the correct result.
To that end, should we decorate the EmbeddingFunction.__call__ method with a decorator that validates the output of __call__ using the existing function chromadb.api.types.validate_embeddings function. We can achieve the decoration in subclasses using either metaclasses or decorating the base class itself. What do you think?
@GauravWaghmare Thanks for you input. Even After Changing the Model name to the correct one the KeyError issue persisted if the api_key is left empty as it was in https://discord.com/channels/1073293645303795742/1074711446589542552/1183352785295654972. Is there a chance that api_key part was left empty while referencing the documentation? As providing my api_key into the same code worked as intended.
@hey-sagar The response of the embedding function when the API key is missing is
{
'error': "Authorization header is invalid, use 'Bearer API_TOKEN'"
}
which would raise the same KeyError. Have raised a pull request to validate the response and throw a more informative error. https://github.com/chroma-core/chroma/pull/1615 Please review.