chroma Decoupled Architecture refactor

Description of changes

Begin preparing the codebase for a model that supports decoupled reads and writes.

[x] Update config system to allow dependency injection via class name
[x] Create new interfaces for SysDB, segement, and segment management code.
[x] Update DB method signatures to support append-only and soft delete
[x] Migration system for SQL-based databases
[ ] Build new implementation of chromadb.api.API to utilize new interfaces
[ ] Provide concrete implementations for all interfaces as needed to replicate current library-only and single server deployment models

Interfaces & Data Flow

Test plan

Unit and integration tests continue to pass.

Documentation Changes

TBD on how much we should try to mitigate breaking changes. The user-facing API can remain 100% consistent, however the schema for the DuckDB and Clickhouse databases may need to change in breaking ways. We do not yet have a system for DB migrations or versioned schema.

Mar 12 '23 23:03 levand

@HammadB just bouncing something off you. Most the code up until now presumes either UUIDs or strings for embedding IDs.

Managing multiple types across multiple databases was super cumbersome, so I went ahead and switched everything to the lowest common denominator (strings).

If we really want to save space, we can still store UUIDs as base85 strings, meaning they can be saved as 160 bits which isn't too much worse than their native 128.

Given that embedding IDs only truly need to be unique within a collection/segment, I think strings are an OK choice.

LMK what you think.

Mar 13 '23 13:03 levand

I still prefer if we use the underlying number representation in the places that support it but perhaps a needless optimization. If we need to do seq-scan equivalents over the id we also benefit from vectorized comparisons.

Will leave it up to you as I don’t know if it’ll matter much at all, and is unlikely.

We can leave it as strings if it’s easier now. Out of curiosity what’s the support for storing UIIDs like across the various DBs?

On Mon, Mar 13, 2023 at 6:46 AM Luke VanderHart @.***> wrote:

@HammadB https://github.com/HammadB just bouncing something off you. Most the code up until now presumes either UUIDs or strings for embedding IDs.

Managing multiple types across multiple databases was super cumbersome, so I went ahead and switched everything to the lowest common denominator (strings).

If we really want to save space, we can still store UUIDs as base85 strings, meaning they can be saved as 160 bits which isn't too much worse than their native 128.

Given that embedding IDs only truly need to be unique within a collection/segment, I think strings are an OK choice.

LMK what you think.

— Reply to this email directly, view it on GitHub https://github.com/chroma-core/chroma/pull/214#issuecomment-1466172783, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABKW32M7B2MR6B27V6SBP7TW34QJ5ANCNFSM6AAAAAAVYNBMVI . You are receiving this because you were mentioned.Message ID: @.***>

-- Hammad Bashir EECS | Cal

Mar 13 '23 13:03 HammadB

@HammadB most DBs have a dedicated UUID type or at least fixed-length binary data (which is effectively the same.) Strangely though, CrateDB is not one of them: we'd have to encode a UUID as a string to store it there anyway.

Mar 13 '23 14:03 levand

Cool so it’s cratedb and duckdb that aren’t supporting it. Got it!

On Mon, Mar 13, 2023 at 7:40 AM Luke VanderHart @.***> wrote:

@HammadB https://github.com/HammadB most DBs have a dedicated UUID type or at least fixed-length binary data (which is effectively the same.) Strangely though, CrateDB is not one of them: we'd have to encode a UUID as a string to store it there anyway.

— Reply to this email directly, view it on GitHub https://github.com/chroma-core/chroma/pull/214#issuecomment-1466269560, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABKW32PG63OU36RAQROFTHDW34WW7ANCNFSM6AAAAAAVYNBMVI . You are receiving this because you were mentioned.Message ID: @.***>

-- Hammad Bashir EECS | Cal

Mar 13 '23 14:03 HammadB

Top level question - how are segment types assigned to collections? Are you imagining it as a data-responsive process in the placement engine - thats what I am assuming, so want to confirm.

Mar 13 '23 16:03 HammadB

Top level question - how are segment types assigned to collections? Are you imagining it as a data-responsive process in the placement engine - thats what I am assuming, so want to confirm.

I see this as the primary axis along which the system will evolve.

We'll start very simple: one segment per collection, assigned locally in one of the impls (probably the stream reader).

Eventually we'll have the full distributed architecture with a dedicated supervisor process creating segments for different partitions/shards dynamically in response to load, tuning parameters and user hints.

There will probably be some steps in between too.

In all cases the SysDB will be the source of truth about what segments exist and where reads and writes should be directed.

Mar 13 '23 17:03 levand

Closing because this is too big and too far out of date. Going to re-create using smaller PRs.

Apr 27 '23 01:04 levand

chroma chroma copied to clipboard

Decoupled Architecture refactor

Description of changes

Interfaces & Data Flow

Test plan

Documentation Changes

chroma
chroma copied to clipboard