chroma icon indicating copy to clipboard operation
chroma copied to clipboard

[Feature Request]: Split up `embedding_functions.py`

Open atroyn opened this issue 1 year ago • 5 comments

Describe the problem

The embedding functions module is a single file which is getting really unwieldy. https://github.com/chroma-core/chroma/blob/main/chromadb/utils/embedding_functions.py

For example, it's difficult to land PR's like https://github.com/chroma-core/chroma/pull/1447 because they need auxiliary utilities, which probably shouldn't live in a global module.

Describe the proposed solution

Split up the file into one file per EF, which allows us to bundle the required utilities / helpers with each function separately.

Alternatives considered

Not doing this, which I think we will eventually have to do.

Importance

would make my life easier

Additional Information

No response

atroyn avatar Apr 03 '24 02:04 atroyn

Hi @atroyn hope you are doing great. I was looking for something else and suddenly landed in this issue. It will be my first contribution to this repo, so I'm fairly new to the code base. However, the changes feel pretty straightforward, are you happy for me to pick this up?

nablabits avatar Apr 06 '24 08:04 nablabits

Hi @nablabits, happy for you to take a crack at it. Please add me as a reviewer on your PR when it's ready.

atroyn avatar Apr 09 '24 00:04 atroyn

Hey @atroyn, just a quick update on this as I'm halfway through the changes to make sure we are on the same page:

Key Bits

  • I'm creating a module called embedding_functions whose __init__ file will import the classes contained in the individual files. This way we won't break the existing implementations of the users. Something like this:
chromadb/utils
├── embedding_functions
│   ├── __init__.py
│   ├── amazon_bedrock_embedding_function.py
│   ├── cohere_embedding_function.py
│   ├── google_embedding_function.py
│   ├── huggingface_embedding_function.py
  • I've placed the classes of google —GooglePalmEmbeddingFunction, GoogleGenerativeAiEmbeddingFunction, GoogleVertexEmbeddingFunction— and, HF —HuggingFaceEmbeddingFunction, HuggingFaceEmbeddingServer— in the same file
  • It feels to me that DefaultEmbeddingFunction should live in __init__ but I don't have a strong opinion on that.
  • I see that there are few tests for the embeddings. I appreciate that it's hard because they involve imports that are not required by requirements.txt but we can still test things like these exceptions. I'm happy to address them unless you tell me otherwise.

Other Nits

  • To sort the imports I will use isort (which is not a requirement) unless you tell me otherwise.
  • I saw that there are docstrings and strings over the 88 chars defined for black which is expected because black makes no effort in reflowing them. I'm a bit paranoid about that, so unless you tell me otherwise I will reflow them to fit the limitation.
  • I have spotted a few other super minor things like this useless try block as requests is a requirement or this None assignment to a str

PR approach

This PR will contain a great deal of changes so to facilitate it what I have in mind is:

  • Create a commit with the new module (empty)
  • Create a commit for each moved function so it will be easy to check that they contain exactly the same content
  • I will do all above skipping pre-commit hooks as they were complaining about something, so next I will create a commit with the pre-commit cleanings
  • Then I will address the extra tests, and the other nits

Sorry for this bible up here :sweat_smile:

nablabits avatar Apr 13 '24 08:04 nablabits

Hi @nablabits - this seems like the right approach to me. Another approach to producing clean self-contained changes might be to use stacked PRs with something like https://graphite.dev/ (we use this at Chroma), but I am also happy to follow along with your approach for stacking commits.

I agree with default living in __init__ - all it should do however is import the actual default EF from its own file, to make sure this stays decoupled and we're not doing any actual EF stuff in __init__

Regarding complaining from the pre-commit hooks, it's likely that a lot of those complaints are due to type errors. If we can also clean up the type errors in the EFs as part of these changes that would be a huge W.

atroyn avatar Apr 15 '24 17:04 atroyn

Oh wow, thanks for sharing, this graphite thing looks really neat and the learning curve seems not very steep. I'm not quite sure, though, whether this issue is the best place for to use this tool for the first time as I may end up messing up everything :grimacing: so, I'd say we are better off if I try this tool on some personal project and with a smaller piece of work and then on future contributions :crossed_fingers: , I can open those stacked PRs as needed.

Sure thing, each EF will live in their own separate file :+1:

nablabits avatar Apr 16 '24 05:04 nablabits