chroma
chroma copied to clipboard
[ENH] 1965 Split up embedding functions
Description of changes
- It addresses this issue: https://github.com/chroma-core/chroma/issues/1965
There are a number of things going on in this PR. I've split the changes in meaningful commits to facilitate the review.
- The first commit creates the empty module and renames the old
embedding_functions.pyffc5e9135e1e5cc471849a8cf4fc7607d287a4ff - Then, there's one commit per embedding function moved so the reviewer can check side by side that the contents were moved word by word.
- Above was made skipping the linters to avoid noise, so after them there's a commit that lints the files which should be pretty innocuous . 6ad759829df125d17e6b26d20ebb987b70861705
- However, the linting over the onnx embedding function felt quite sensible to me, so I decided to put it in a separate commit 8f08d60ebdfbe8aa7f6ef37c03a6ca1ffceb6a0e
Besides, there are a few docstrings and some tests that I'm working on as discussed here for which I will open a follow up PR to avoid noise here.
@atroyn, can you please take a look?
Test plan
How are these changes tested? I launched a couple of times the whole test suite finding that they took a lot of time to complete (in some 1h I had only covered 35%). The second time I launched them, the computer even died, but not really sure if it was because of these tests only or because something else happening at that time.
I didn't know that there were tests for JS or rust, I will run them next and tick the checkboxes as appropriate. I've put the outcomes in this comment
- [x] Tests for this feature pass locally with
pytestfor python, - [ ] The whole test suite pass locally.
- [ ]
yarn testfor js, - [ ]
cargo testfor rust
Documentation Changes
Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs repository?
In principle all the docstrings are as they were, and users' implementations won't be affected. However, as said before, I'm working on a follow up PR that will add a few more tests and edit docstrings where appropriate.
Reviewer Checklist
Please leverage this checklist to ensure your code review is thorough before approving
Testing, Bugs, Errors, Logs, Documentation
- [ ] Can you think of any use case in which the code does not behave as intended? Have they been tested?
- [ ] Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
- [ ] If appropriate, are there adequate property based tests?
- [ ] If appropriate, are there adequate unit tests?
- [ ] Should any logging, debugging, tracing information be added or removed?
- [ ] Are error messages user-friendly?
- [ ] Have all documentation changes needed been made?
- [ ] Have all non-obvious changes been commented?
System Compatibility
- [ ] Are there any potential impacts on other parts of the system or backward compatibility?
- [ ] Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?
Quality
- [ ] Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)
I've been running some tests with some disappointing outcomes :confused:
- My local version of node is complaining with yarn.
cargoseems not fully compiling on my ubuntu:
Error: Custom { kind: Other, error: "protoc failed: chromadb/proto/chroma.proto: This file contains proto3 optional fields, but --experimental_allow_proto3_optional was not set.\n" }
- Finally,
pytestmanaged to run in some 2.5h but it threw several failures:
FAILED chromadb/test/property/test_collections_with_database_tenant_overwrite.py::test_collections_with_tenant_database_overwrite - exceptiongroup.BaseExceptionGroup: Hypothesis found 2 distinct failures. (2 sub-exceptions)
FAILED chromadb/test/property/test_collections_with_database_tenant_overwrite.py::test_repeat_failure - chromadb.errors.InvalidUUIDError: Could not parse A00 as a UUID
FAILED chromadb/test/stress/test_many_collections.py::test_many_collections[sqlite_persistent] - OSError: [Errno 24] Too many open files: '/tmp/tmp_g4jh1g1/07f976f6-1d15-47b6-ab12-e1e7ad2def5d/index_metadata.pickle'
FAILED chromadb/test/test_api.py::test_persist_index_loading[local_persist_api] - sqlite3.OperationalError: unable to open database file
FAILED chromadb/test/test_api.py::test_persist_index_loading_embedding_function[local_persist_api] - ValueError: Could not connect to tenant default_tenant. Are you sure it exists?
FAILED chromadb/test/test_api.py::test_persist_index_get_or_create_embedding_function[local_persist_api] - ValueError: Could not connect to tenant default_tenant. Are you sure it exists?
FAILED chromadb/test/test_api.py::test_persist[local_persist_api] - ValueError: Could not connect to tenant default_tenant. Are you sure it exists?
FAILED chromadb/test/test_api.py::test_persist_index_loading_params[fastapi] - ValueError: Could not connect to tenant default_tenant. Are you sure it exists?
FAILED chromadb/test/test_api.py::test_persist_index_loading_params[fastapi_persistent] - ValueError: Could not connect to tenant default_tenant. Are you sure it exists?
FAILED chromadb/test/test_api.py::test_persist_index_loading_params[sqlite] - ValueError: Could not connect to tenant default_tenant. Are you sure it exists?
FAILED chromadb/test/test_api.py::test_persist_index_loading_params[sqlite_persistent] - ValueError: Could not connect to tenant default_tenant. Are you sure it exists?
For the first two I prefer not dealing with those things for the time being, unless someone tells me so. Regarding the second, I will take a look this week but it does not seem particulary related to my changes although after those failures there could be some assertion pointing to them :thinking:
Update: 2024-04-24 Today I spent some time checking the python failing tests:
- Weirdly, the
test_apiones passed without issues which make me think of either a one-off failure or some persistence in the pytest session: - Regarding
test_many_collectionsI had to stop it when it arrived to 11Gb of RAM :sweat_smile: which may explain why the first time my computer died. I'm keen to put some warning in Develop.md if we agree that's a sensible thing to do. - In terms of the
property/test_collections_*they seem related to thischromadb.errors.InvalidUUIDError: Could not parse A00 as a UUID. One of them takes 36" to run which is a bit of a pain to debug, so if anyone can shed some light it would be greatly appreciated.
I think this is on the right path. There is a way to go, however.
1. Move the embedding_function dir at the top level, and maybe for conciseness and alignment with the rest of the ecosystem, call it `embeddings`. 2. Making such a significant change without the requisite testing is inappropriate. It is very difficult to compare each EF to its previous implementation in this way. I also see that you made some decisions to change things around, which makes reasoning about the changes in this PR even harder. 3. Stacking these changes might have made it a little easier to review.
Hi @tazarov, thanks for your feedback, greatly appreciated 🙂
For clarity, I'm going to put my thoughts here instead of replying the other comment you left, hope it makes sense.
First of all, I'm deeply sorry that you found the PR hard to make sense as I made every effort to make it as clear as possible 😞 splitting each EF into its own commit, so in my mind, the reviewer could put side by side two github windows with, say, this commit and easily check that the functions were moved word by word. I appreciate that if you use graphite —or gitkraken like me— you may not even go to github, sorry about that. I was advised that I could use this stacking process, but I dismissed the idea as this PR might not be the best place to test a new tool and stuck to a process I know. I could have open, though, different PRs manually as this is something that I've done in the past, but for this PR I find that a bit overkilling if you don't have a tool like graphite that, appears to me, does the heavy lifting for you.
I'm a bit confused as to why do you think the __init__.py will break existing implementations. In my experience this is what you do when you have to move a file to a module but the alternative you propose will also work. You can check that this approach works as this test file wasn't touched by this PR and it still holds. I'm happy with your approach but maybe you guys want to come to grips as to how do you want the code look as here we agreed the __init__.py approach.
Regarding the tests I totally agree with you, I found this whole thing very poorly tested, e.g. this OpenAIEmbeddingFunction does not have a single test, that's why I was planning to open a separate PR with all the missing tests even when, arguably, they are out of the scope of the issue itself. Still, I personally checked that all the tests in chromadb/test/ef/ are passing. The only ones failing locally for me are these but I didn't bother very much with passing all the test suite locally as I saw that you have a bunch of Github actions doing the full chore for you and for more platforms than my specific Ubuntu. Now I realise that the workflows need an a approval of a maintainer but in principle they should tell us what the failures are. Regarding the rust and js ones, tbh, is not in my short term plans to learn those things, that's why I prefer not to invest time on them if possible as they are ad hocs for me, hope this makes sense.
Sorry for all this lengthy reply :sweat_smile:, let me know what you think. Meanwhile, I will address these two bits:
@tazarov I'll take over review of this one, @nablabits and I have been working on it on https://github.com/chroma-core/chroma/issues/1965 and I've agreed to his approach (as you can see on the discussion on the issue)
@nablabits I'll allocate time to review this in depth in the next day or two, thanks for the changes here!
Hi @atroyn thanks for your input!
Before you jump onto a deep review, let me add something: I certainly see a lot of value on what @tazarov says about the tests, would we be better off writing them upfront at least to ensure that some broad assumptions hold? I'm meaning something like this where we can ensure that:
- One can call a given embedding function
- It does return a list of floats
This is how this whole thing looks in my mind:
- We create a PR with tests like that of the example out of current
main(now they are branching off this branch) - Potentially we can address the linter issues in that PR.
- We merge them onto
mainonce everything looks good. - We merge
mainonto this branch and pull out our best prays in order to have them working
Does this seem reasonable in your end?
@nablabits How would we test the EFs which require API keys? How about the ones which require packages which are not installed? And what would this test beyond the implementation of the EmbeddingFunction protocol?
I don't think it's realistic for us to write these tests, and may be up to their contributors to do so.
@atroyn Yeah, totally agree, ideally the contributors should have provided the tests or provide them in the future, by having their dependencies locally and, for example, recording a cassette with pytest-vcr with the relevant call to the API. Then all that can be put behind a skip like this folk so local implementations won't require the external lib although having them on the workflows may well be desirable.
Yet, at the very least, we might want to have a simple test that ensures that the given embedding function does exist, and its __call__ method can be called with some specific set of args. Then, everything else can be mocked as you can see in the test I wrote. This is what I meant by broad assumptions.
But not pressing too hard, if you are happy with moving all these functions, then I'm happy :smiley:
I think we can let it rip. Let's do the move.
@nablabits do you still plan to wrap this up? Is it currently ready for review?
The latest updates on your projects. Learn more about Vercel for Git ↗︎
| Name | Status | Preview | Comments | Updated (UTC) |
|---|---|---|---|---|
| chroma | ✅ Ready (Inspect) | Visit Preview | 💬 Add feedback | Jun 20, 2024 9:09pm |
@nablabits do you still plan to wrap this up? Is it currently ready for review?
Hi @atroyn I actually had a task for myself today to checkin with this. Sure, the work is ready to be reviewed, it appeared to me that the work on the tests was not really needed by this comment.
I've just ffd the branch to resolve the conflicts as this folk had linted the old embedding_functions.py bringing it back to the history.
Great, thank you! Let me review changes from here and we can get this landed.
In the meantime if you have the stomach for it, this might be a great follow-up https://github.com/chroma-core/chroma/issues/2128
In the meantime if you have the stomach for it, this might be a great follow-up #2128
Well, let's see what @BalasubramanyamEvani says as he seemed interested on the issue. If he has moved on, I'm happy to handle it.
@nablabits @atroyn,
We have a large backlog of new EFs and bugfixes:
- https://github.com/chroma-core/chroma/pull/1271
- https://github.com/chroma-core/chroma/pull/1327
- https://github.com/chroma-core/chroma/pull/1625
- https://github.com/chroma-core/chroma/pull/1675
- https://github.com/chroma-core/chroma/pull/2000
- https://github.com/chroma-core/chroma/pull/1977
- https://github.com/chroma-core/chroma/pull/2049
- https://github.com/chroma-core/chroma/pull/2182
- https://github.com/chroma-core/chroma/pull/1723
- https://github.com/chroma-core/chroma/pull/1986
- https://github.com/chroma-core/chroma/pull/1915
- https://github.com/chroma-core/chroma/pull/1791
- https://github.com/chroma-core/chroma/pull/1892
Note: There may be a few others that I have missied
This is a solid piece of work that I think we must not ignore, as community members have contributed time and effort to making these PRs. I think we should first resolve all EF-related PRs up to now and then proceed with this refactoring without triggering a relatively large rework for all contributors and reviewers involved.
@tazarov this PR is intended to make landing those other ones (and future ones) easier, and since they are purely additive it's a simple change which we can do ourselves to for the outstanding new EFs. Fine to land this first then take a look over those others.
Hi @atroyn, just a gently reminder that this folk is dangling 🙂. Let me know if you want me to do something else to land it.
@nablabits will be landing this week!
@atroyn is attempting to deploy a commit to the chromacore Team on Vercel.
A member of the Team first needs to authorize it.
We are good to merge! I made a few more changes:
-
the top-level
embedding_functionsmodule now dynamically imports anything that's a sub-class ofEmbeddingFunctionfrom any file in the same directory, and also adds it to the builtins list. This means future additions will be automatic here and won't need to be edited by contributors. -
merged in the
httpxoverrequestchanges from main -
Made sure
ChromaLangchainEmbeddingFunctionis in the builtins list; it previously wasn't because of the way we use a factory to construct them. Here we just hardcode adding them. -
Added tests to make sure that the right things were being imported, and all the things which were imported were the right type.
Thank you @nablabits for your hard work and patience!
My changes broke on python versions 3.9, 3.10 - reverting and fixing.
We are good to merge! I made a few more changes:
* the top-level `embedding_functions` module now dynamically imports anything that's a sub-class of `EmbeddingFunction` from any file in the same directory, and also adds it to the builtins list. This means future additions will be automatic here and won't need to be edited by contributors. * merged in the `httpx` over `request` changes from main * Made sure `ChromaLangchainEmbeddingFunction` is in the builtins list; it previously wasn't because of the way we use a factory to construct them. Here we just hardcode adding them. * Added tests to make sure that the right things were being imported, and all the things which were imported were the right type.Thank you @nablabits for your hard work and patience!
@atroyn
Yay! Finally :rocket: To be honest it has been a great learning experience for me, thank you for the opportunity and the trust, greatly appreciated!! BTW, that dynamic importing looks really cool —despite of the issues on the IDEs—
We will find a way to make it work on IDEs, as a first cut @tazarov will take a look!