semantic-kernel icon indicating copy to clipboard operation
semantic-kernel copied to clipboard

Python: feat: add chroma memory store

Open joowon-dm-snu opened this issue 2 years ago • 16 comments

Motivation and Context

solve issue #403 this is reopened PR #426

Description

I've added one quick E2E test example without adding any unit tests. I tried to make it as identical to the original classes(VolatileDataStore & VolatileMemoryStore) as possible.

This PR has 3 issues.

  • collection name does not allow upper cases(https://github.com/chroma-core/chroma/issues/237). I've written a snake case to replace it inside the current class.
  • get_nearest_matches_async function can be more complicated if your team wants to calculate relevance score like langchain.
  • I didn't add requirements (pip install chromadb)

Contribution Checklist

  • [ ] The code builds clean without any errors or warnings
  • [ ] The PR follows SK Contribution Guidelines (https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md)
  • [ ] The code follows the .NET coding conventions (https://learn.microsoft.com/dotnet/csharp/fundamentals/coding-style/coding-conventions) verified with dotnet format
  • [ ] All unit tests pass, and I have added new tests where possible
  • [ ] I didn't break anyone :smile:

joowon-dm-snu avatar Apr 14 '23 06:04 joowon-dm-snu

Hi @joowon-dm-snu

Tried this out with the memory notebook and I have a couple questions. By default, if you replace the VolatileMemoryStore() with ChromaMemoryStore(), the memory store uses embedded DuckDB and I'm pretty confident that it's downloading an embedding model from sentence-transformers - ie the embeddings are not coming from the Kernel, but being generated by chroma infrastructure. Presumably this behavior can be changed using a chromadb settings file?

At minimum, we should make sure to document the default behavior, but ideally the base ChromaMemoryStore() without additional settings should use embeddings from the kernel.

awharrison-28 avatar Apr 17 '23 18:04 awharrison-28

@awharrison-28

At minimum, we should make sure to document the default behavior, but ideally the base ChromaMemoryStore() without additional settings should use embeddings from the kernel.

Agree with documentation! However, I didn't write the documentation because the semantic-kernel is a fully managed by MS, and I was worried that it wouldn't be adopted even if I wrote the official documentation. this is same for unit-tests

If you don't mind me writing it, and tell me what you want it to contain, I can write it :)

joowon-dm-snu avatar Apr 18 '23 06:04 joowon-dm-snu

@awharrison-28

At minimum, we should make sure to document the default behavior, but ideally the base ChromaMemoryStore() without additional settings should use embeddings from the kernel.

Agree with documentation! However, I didn't write the documentation because the semantic-kernel is a fully managed by MS, and I was worried that it wouldn't be adopted even if I wrote the official documentation. this is same for unit-tests

If you don't mind me writing it, and tell me what you want it to contain, I can write it :)

We are pro-documentation!

Can you add a description of how ChromaMemoryStore works at the top of ChromaMemoryStore.py. Mentioning things like the kernel generates the embeddings and the decision to custom calculate similarity instead of using ChromaDB nearest neighbor implementation are great things to add. @mkarle @alexchaomander would you agree that this is a good place to add implementation detail comments?

awharrison-28 avatar Apr 18 '23 21:04 awharrison-28

Yup +1 to putting directly in the comments the implementation details if we have to make certain choices on how to use Chroma with the SK.

alexchaomander avatar Apr 18 '23 22:04 alexchaomander

@joowon-dm-snu @awharrison-28 Do we need to add a line to the requirements to pip install Chroma?

If there are concerns for not putting too much bloat into the SK package, maybe put an ImportError message when using the Chroma memory store saying to run pip install chromadb?

alexchaomander avatar Apr 24 '23 17:04 alexchaomander

@joowon-dm-snu I take it back. chromadb should be added to the requirements.txt and the poetry dependencies, otherwise the unit tests will always fail.

You add it to poetry with poetry add chromadb

awharrison-28 avatar Apr 26 '23 00:04 awharrison-28

@joowon-dm-snu I take it back. chromadb should be added to the requirements.txt and the poetry dependencies, otherwise the unit tests will always fail.

You add it to poetry with poetry add chromadb

I think it will be better to make test_integration dependencies group in poetry. hope these integration dependencies will be decoupled later :)

joowon-dm-snu avatar Apr 26 '23 04:04 joowon-dm-snu

@joowon-dm-snu

Would you be willing to start the test_integration dependencies group in poetry?

awharrison-28 avatar Apr 26 '23 21:04 awharrison-28

@joowon-dm-snu heads up, Python SK memory is getting a refactor -> major simplification of classes. If this PR goes through first, I'll take point on updating Chroma memory store.

https://github.com/microsoft/semantic-kernel/pull/684

awharrison-28 avatar Apr 26 '23 23:04 awharrison-28

@joowon-dm-snu

Would you be willing to start the test_integration dependencies group in poetry?

okay, i think splitting dependencies should be merged first. I'll take a look when I get to work

@joowon-dm-snu heads up, Python SK memory is getting a refactor -> major simplification of classes. If this PR goes through first, I'll take point on updating Chroma memory store.

No problem!

joowon-dm-snu avatar Apr 27 '23 01:04 joowon-dm-snu

@awharrison-28 i do some searches to fix installation of chromadb for macos-latest, python3.11 but couldn't find good solution

changing semantic-kernel/.github/workflows/python-test.yml like below raise other clang errors

- name: Setup envs
  if: matrix.os == 'macos-latest' && matrix.python-version == '3.11'
  run: |
    export POETRY_HNSWLIB_NO_NATIVE=1
    python -m pip wheel --use-pep517 "hnswlib (==0.7.0)"
    cd python && poetry install --with test_integration

joowon-dm-snu avatar Apr 27 '23 09:04 joowon-dm-snu

Currently reviewing to ensure in line with both SK versions.

tawalke avatar May 01 '23 15:05 tawalke

@tawalke

Currently reviewing to ensure in line with both SK versions.

this PR might need change as memory store changed by #684. If your team wants to me to done with this PR, i can take it. But installation of chromadb for macos-latest, python3.11 should be solved first to pass Python tests / build (3.11, macos-latest) (pull_request)

joowon-dm-snu avatar May 01 '23 16:05 joowon-dm-snu

@tawalke

Currently reviewing to ensure in line with both SK versions.

this PR might need change as memory store changed by #684. If your team wants to me to done with this PR, i can take it. But installation of chromadb for macos-latest, python3.11 should be solved first to pass Python tests / build (3.11, macos-latest) (pull_request)

@awharrison-28 Per your last comment re; your update of chroma memory based upon memory_base update.; should the current review be paused?

tawalke avatar May 01 '23 17:05 tawalke

@joowon-dm-snu would you be open to me updating this PR directly to update it to be consistent with https://github.com/microsoft/semantic-kernel/pull/684.

I'll also have a PR out shortly to solidify how we handle conditional dependencies.

awharrison-28 avatar May 04 '23 00:05 awharrison-28

@joowon-dm-snu would you be open to me updating this PR directly to update it to be consistent with #684.

I'll also have a PR out shortly to solidify how we handle conditional dependencies.

Okay I get it, I'll handle it this week :)

joowon-dm-snu avatar May 07 '23 17:05 joowon-dm-snu

reopening and updating

awharrison-28 avatar May 11 '23 18:05 awharrison-28

@awharrison-28 ah my forked branch was so twisted so I rebranched it.

joowon-dm-snu avatar May 11 '23 19:05 joowon-dm-snu