chroma
chroma copied to clipboard
Decoupled Architecture refactor
Description of changes
Begin preparing the codebase for a model that supports decoupled reads and writes.
- [x] Update config system to allow dependency injection via class name
- [x] Create new interfaces for SysDB, segement, and segment management code.
- [x] Update DB method signatures to support append-only and soft delete
- [x] Migration system for SQL-based databases
- [ ] Build new implementation of
chromadb.api.API
to utilize new interfaces - [ ] Provide concrete implementations for all interfaces as needed to replicate current library-only and single server deployment models
Interfaces & Data Flow

Test plan
Unit and integration tests continue to pass.
Documentation Changes
TBD on how much we should try to mitigate breaking changes. The user-facing API can remain 100% consistent, however the schema for the DuckDB and Clickhouse databases may need to change in breaking ways. We do not yet have a system for DB migrations or versioned schema.
@HammadB just bouncing something off you. Most the code up until now presumes either UUIDs or strings for embedding IDs.
Managing multiple types across multiple databases was super cumbersome, so I went ahead and switched everything to the lowest common denominator (strings).
If we really want to save space, we can still store UUIDs as base85 strings, meaning they can be saved as 160 bits which isn't too much worse than their native 128.
Given that embedding IDs only truly need to be unique within a collection/segment, I think strings are an OK choice.
LMK what you think.
I still prefer if we use the underlying number representation in the places that support it but perhaps a needless optimization. If we need to do seq-scan equivalents over the id we also benefit from vectorized comparisons.
Will leave it up to you as I don’t know if it’ll matter much at all, and is unlikely.
We can leave it as strings if it’s easier now. Out of curiosity what’s the support for storing UIIDs like across the various DBs?
On Mon, Mar 13, 2023 at 6:46 AM Luke VanderHart @.***> wrote:
@HammadB https://github.com/HammadB just bouncing something off you. Most the code up until now presumes either UUIDs or strings for embedding IDs.
Managing multiple types across multiple databases was super cumbersome, so I went ahead and switched everything to the lowest common denominator (strings).
If we really want to save space, we can still store UUIDs as base85 strings, meaning they can be saved as 160 bits which isn't too much worse than their native 128.
Given that embedding IDs only truly need to be unique within a collection/segment, I think strings are an OK choice.
LMK what you think.
— Reply to this email directly, view it on GitHub https://github.com/chroma-core/chroma/pull/214#issuecomment-1466172783, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABKW32M7B2MR6B27V6SBP7TW34QJ5ANCNFSM6AAAAAAVYNBMVI . You are receiving this because you were mentioned.Message ID: @.***>
-- Hammad Bashir EECS | Cal
@HammadB most DBs have a dedicated UUID type or at least fixed-length binary data (which is effectively the same.) Strangely though, CrateDB is not one of them: we'd have to encode a UUID as a string to store it there anyway.
Cool so it’s cratedb and duckdb that aren’t supporting it. Got it!
On Mon, Mar 13, 2023 at 7:40 AM Luke VanderHart @.***> wrote:
@HammadB https://github.com/HammadB most DBs have a dedicated UUID type or at least fixed-length binary data (which is effectively the same.) Strangely though, CrateDB is not one of them: we'd have to encode a UUID as a string to store it there anyway.
— Reply to this email directly, view it on GitHub https://github.com/chroma-core/chroma/pull/214#issuecomment-1466269560, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABKW32PG63OU36RAQROFTHDW34WW7ANCNFSM6AAAAAAVYNBMVI . You are receiving this because you were mentioned.Message ID: @.***>
-- Hammad Bashir EECS | Cal
Top level question - how are segment types assigned to collections? Are you imagining it as a data-responsive process in the placement engine - thats what I am assuming, so want to confirm.
Top level question - how are segment types assigned to collections? Are you imagining it as a data-responsive process in the placement engine - thats what I am assuming, so want to confirm.
I see this as the primary axis along which the system will evolve.
We'll start very simple: one segment per collection, assigned locally in one of the impls (probably the stream reader).
Eventually we'll have the full distributed architecture with a dedicated supervisor process creating segments for different partitions/shards dynamically in response to load, tuning parameters and user hints.
There will probably be some steps in between too.
In all cases the SysDB will be the source of truth about what segments exist and where reads and writes should be directed.
Closing because this is too big and too far out of date. Going to re-create using smaller PRs.