chroma icon indicating copy to clipboard operation
chroma copied to clipboard

Row Oriented `add`

Open jeffchuber opened this issue 1 year ago • 2 comments

Chroma currently is column-oriented for storage and retrieval across the python and js APIs. For example:

collection.add(
    documents=["This is a document", "This is another document"],
    metadatas=[{"source": "my_source"}, {"source": "my_source"}],
    ids=["id1", "id2"]
)
await collection.add(
    ["id1", "id2"],
    undefined,
    [{"source": "my_source"}, {"source": "my_source"}],
    ["This is a document", "This is another document"],
) 

This is good for data-science workflows, for example where the user is already using dataframes to store data and dumping the columns is trivial. However this can be confusing and hard to verify. For example, mentioned here, https://github.com/chroma-core/chroma/issues/254. The behavior in JS is especially weird when we end up forcing the user to pass undefined for positional arguments.

Proposal - something like this:

collection.add([
  Embedding(document="This is a document", metadata={"source": "my_source"}, id="id1),
  Embedding(document="This is another document", metadata={"source": "my_source"}, id="id2),
])
await collection.add([
  {
    "document":"This is a document", 
    "metadata": {"source": "my_source"}, 
    "id": "id1"
  },
  {
    "document":"This is another document", 
    "metadata": {"source": "my_source"}, 
    "id": "id2"
  }
]) 

I think we probably should accept both row and column oriented and I think we can pretty easily internalize the logic to handle that.

This write-up is specific to add, but upsert/update, should also support this.

query results is a tricker case, perhaps we would need to allow the user to define whether they want the results as column or row oriented. I think it would make sense for row to be the default, however it's worth noting this would be a breaking change.

jeffchuber avatar Apr 03 '23 01:04 jeffchuber

For the JS api, would using this also work? To solve the problem mentioned in #254 https://simonsmith.io/destructuring-objects-as-function-parameters-in-es6

HammadB avatar Apr 03 '23 16:04 HammadB

Yes! I agree that the JS-column oriented should use destructuring like that - it would make the API much cleaner.

The other case to consider here is bulk_add where users want to load in possibly millions of items at a time. We may not want to support column and row for that use case for performance reasons.

jeffchuber avatar Apr 03 '23 17:04 jeffchuber

here is a proof of concept for a simple functional wrapper to enable this interface https://github.com/chroma-core/chroma/pull/719

jeffchuber avatar Jun 23 '23 17:06 jeffchuber

closing in favor of this https://github.com/chroma-core/chroma/issues/420

jeffchuber avatar Jun 23 '23 17:06 jeffchuber