chroma icon indicating copy to clipboard operation
chroma copied to clipboard

[DOC]: Add formal definitions for `where` and `where_document` filter syntax

Open hesreallyhim opened this issue 8 months ago • 2 comments

Description

The syntax for where and where_document filters is not formally defined anywhere. The Docs site ("Querying Collections"), and also the Reference pages for the client > Collections, covers most of it, but there are a few details that are not documented anywhere. (E.g., can the where dictionary have multiple entries? (no)) To make things worse, the documentation has/had examples that were non-well-formed (fix for this has been merged #4096 but not deployed yet I think). What's the problem?

  • No source of truth for the where(_document) grammar - therefore, possibility of drift/inconsistency.
  • Increases difficulty for partners to add integrations if they don't know the exact syntax/grammar. (E.g. langchain-chroma has problems on this point.)

Fix

Would like to hear from maintainers. Could be a new page/section of the docs site, could be a more rigorous/strict/thorough type declaration for Where and WhereDocument types. (Currently, part of the "grammar" is not codified in the types, since it's not totally straightforward to declare a type of "Dict of length 1", so this is enforced elsewhere during a validation phase.) I would probably say a slight mixture of both, but I think this is important to fix for internal and external documentation purposes, as well as for better type-checking.

hesreallyhim avatar Apr 15 '25 14:04 hesreallyhim

@hesreallyhim i think thats a really great idea. are you interested in giving it a shot?

jeffchuber avatar Apr 15 '25 23:04 jeffchuber

Yeah, for sure! I think there's a couple of attack surfaces:

  • The current documentation site is pretty comprehensive and readable (although often relies on examples and we have to make sure those are correct/stay up to date), but there's no formal definition. So we should at least document the little edge case stuff that I mentioned, like:

    • filter dictionaries can (and must) have one top-level key. This is non-well-formed: { "foo": "bar", "baz": "qux" } and so is {}

    • {"$contains": [1, "hello", True]} is non-well-formed (contains array must be single-typed)

  • Do we want to have a reference page with a formal definition of the grammar? (In addition to the existing more-readable style of the "Querying Collections" docs.) Something like JSON Schema could be good, maybe if we end up defining a schema in the code we can just reference it.

  • Static type-checking of filters

  • Run-time validation of filters

I haven't dug that deep into it, but it seems like the best options are jsonschema or pydantic. i think we have both of these as dependencies already. it seems to my like pydantic is the best option, so to define Where and WhereDocument as Pydantic models, and we can export to JSON schema for API documentation, and then add a few comments to the existing docsite pages.

Sounds good?

I don't know of a good solution for the len(filter_dict) == 1 constraint, but I think maybe with Pydantic and Annotated it can be done for static checking and runtime validation

hesreallyhim avatar Apr 16 '25 02:04 hesreallyhim