Python: feat: add chroma memory store
Motivation and Context
solve issue #403 this is reopened PR #426
Description
I've added one quick E2E test example without adding any unit tests.
I tried to make it as identical to the original classes(VolatileDataStore & VolatileMemoryStore) as possible.
This PR has 3 issues.
- collection name does not allow upper cases(https://github.com/chroma-core/chroma/issues/237). I've written a snake case to replace it inside the current class.
- get_nearest_matches_async function can be more complicated if your team wants to calculate relevance score like langchain.
- I didn't add requirements (
pip install chromadb)
Contribution Checklist
- [ ] The code builds clean without any errors or warnings
- [ ] The PR follows SK Contribution Guidelines (https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md)
- [ ] The code follows the .NET coding conventions (https://learn.microsoft.com/dotnet/csharp/fundamentals/coding-style/coding-conventions) verified with
dotnet format - [ ] All unit tests pass, and I have added new tests where possible
- [ ] I didn't break anyone :smile:
Hi @joowon-dm-snu
Tried this out with the memory notebook and I have a couple questions. By default, if you replace the VolatileMemoryStore() with ChromaMemoryStore(), the memory store uses embedded DuckDB and I'm pretty confident that it's downloading an embedding model from sentence-transformers - ie the embeddings are not coming from the Kernel, but being generated by chroma infrastructure. Presumably this behavior can be changed using a chromadb settings file?
At minimum, we should make sure to document the default behavior, but ideally the base ChromaMemoryStore() without additional settings should use embeddings from the kernel.
@awharrison-28
At minimum, we should make sure to document the default behavior, but ideally the base ChromaMemoryStore() without additional settings should use embeddings from the kernel.
Agree with documentation! However, I didn't write the documentation because the semantic-kernel is a fully managed by MS, and I was worried that it wouldn't be adopted even if I wrote the official documentation. this is same for unit-tests
If you don't mind me writing it, and tell me what you want it to contain, I can write it :)
@awharrison-28
At minimum, we should make sure to document the default behavior, but ideally the base ChromaMemoryStore() without additional settings should use embeddings from the kernel.
Agree with documentation! However, I didn't write the documentation because the semantic-kernel is a fully managed by MS, and I was worried that it wouldn't be adopted even if I wrote the official documentation. this is same for unit-tests
If you don't mind me writing it, and tell me what you want it to contain, I can write it :)
We are pro-documentation!
Can you add a description of how ChromaMemoryStore works at the top of ChromaMemoryStore.py. Mentioning things like the kernel generates the embeddings and the decision to custom calculate similarity instead of using ChromaDB nearest neighbor implementation are great things to add. @mkarle @alexchaomander would you agree that this is a good place to add implementation detail comments?
Yup +1 to putting directly in the comments the implementation details if we have to make certain choices on how to use Chroma with the SK.
@joowon-dm-snu @awharrison-28 Do we need to add a line to the requirements to pip install Chroma?
If there are concerns for not putting too much bloat into the SK package, maybe put an ImportError message when using the Chroma memory store saying to run pip install chromadb?
@joowon-dm-snu I take it back. chromadb should be added to the requirements.txt and the poetry dependencies, otherwise the unit tests will always fail.
You add it to poetry with poetry add chromadb
@joowon-dm-snu I take it back. chromadb should be added to the requirements.txt and the poetry dependencies, otherwise the unit tests will always fail.
You add it to poetry with
poetry add chromadb
I think it will be better to make test_integration dependencies group in poetry. hope these integration dependencies will be decoupled later :)
@joowon-dm-snu
Would you be willing to start the test_integration dependencies group in poetry?
@joowon-dm-snu heads up, Python SK memory is getting a refactor -> major simplification of classes. If this PR goes through first, I'll take point on updating Chroma memory store.
https://github.com/microsoft/semantic-kernel/pull/684
@joowon-dm-snu
Would you be willing to start the test_integration dependencies group in poetry?
okay, i think splitting dependencies should be merged first. I'll take a look when I get to work
@joowon-dm-snu heads up, Python SK memory is getting a refactor -> major simplification of classes. If this PR goes through first, I'll take point on updating Chroma memory store.
No problem!
@awharrison-28 i do some searches to fix installation of chromadb for macos-latest, python3.11 but couldn't find good solution
changing semantic-kernel/.github/workflows/python-test.yml like below raise other clang errors
- name: Setup envs
if: matrix.os == 'macos-latest' && matrix.python-version == '3.11'
run: |
export POETRY_HNSWLIB_NO_NATIVE=1
python -m pip wheel --use-pep517 "hnswlib (==0.7.0)"
cd python && poetry install --with test_integration
Currently reviewing to ensure in line with both SK versions.
@tawalke
Currently reviewing to ensure in line with both SK versions.
this PR might need change as memory store changed by #684.
If your team wants to me to done with this PR, i can take it.
But installation of chromadb for macos-latest, python3.11 should be solved first to pass Python tests / build (3.11, macos-latest) (pull_request)
@tawalke
Currently reviewing to ensure in line with both SK versions.
this PR might need change as memory store changed by #684. If your team wants to me to done with this PR, i can take it. But installation of
chromadbformacos-latest, python3.11should be solved first to passPython tests / build (3.11, macos-latest) (pull_request)
@awharrison-28 Per your last comment re; your update of chroma memory based upon memory_base update.; should the current review be paused?
@joowon-dm-snu would you be open to me updating this PR directly to update it to be consistent with https://github.com/microsoft/semantic-kernel/pull/684.
I'll also have a PR out shortly to solidify how we handle conditional dependencies.
@joowon-dm-snu would you be open to me updating this PR directly to update it to be consistent with #684.
I'll also have a PR out shortly to solidify how we handle conditional dependencies.
Okay I get it, I'll handle it this week :)
reopening and updating
@awharrison-28 ah my forked branch was so twisted so I rebranched it.