chroma
chroma copied to clipboard
Row Oriented `add`
Chroma currently is column-oriented for storage and retrieval across the python
and js
APIs. For example:
collection.add(
documents=["This is a document", "This is another document"],
metadatas=[{"source": "my_source"}, {"source": "my_source"}],
ids=["id1", "id2"]
)
await collection.add(
["id1", "id2"],
undefined,
[{"source": "my_source"}, {"source": "my_source"}],
["This is a document", "This is another document"],
)
This is good for data-science workflows, for example where the user is already using dataframes to store data and dumping the columns is trivial. However this can be confusing and hard to verify. For example, mentioned here, https://github.com/chroma-core/chroma/issues/254. The behavior in JS is especially weird when we end up forcing the user to pass undefined
for positional arguments.
Proposal - something like this:
collection.add([
Embedding(document="This is a document", metadata={"source": "my_source"}, id="id1),
Embedding(document="This is another document", metadata={"source": "my_source"}, id="id2),
])
await collection.add([
{
"document":"This is a document",
"metadata": {"source": "my_source"},
"id": "id1"
},
{
"document":"This is another document",
"metadata": {"source": "my_source"},
"id": "id2"
}
])
I think we probably should accept both row and column oriented and I think we can pretty easily internalize the logic to handle that.
This write-up is specific to add
, but upsert
/update
, should also support this.
query
results is a tricker case, perhaps we would need to allow the user to define whether they want the results as column or row oriented. I think it would make sense for row
to be the default, however it's worth noting this would be a breaking change.
For the JS api, would using this also work? To solve the problem mentioned in #254 https://simonsmith.io/destructuring-objects-as-function-parameters-in-es6
Yes! I agree that the JS-column oriented should use destructuring like that - it would make the API much cleaner.
The other case to consider here is bulk_add
where users want to load in possibly millions of items at a time. We may not want to support column and row for that use case for performance reasons.
here is a proof of concept for a simple functional wrapper to enable this interface https://github.com/chroma-core/chroma/pull/719
closing in favor of this https://github.com/chroma-core/chroma/issues/420