chroma
chroma copied to clipboard
[Feature Request]: Querying by metadata fields is not flexible enough
Describe the problem
I am weighing up the trade-off between creating thousands of chroma collections and having few collections with more complex metadata objects so that I will be able to achieve filtering/querying based on different data type operations.
Do you plan to support (in the near future) more operations and data types (mainly custom objects such as json objects) in a given metadata?
Example:
collection.add(
documents=[doc1],
metadatas=[{"metadata1": [{"k1": "v1"}, {"k2": "v2"}]}],
ids=["id1"]
)
Also, do you happen to have any plan to support string $contains
operations in metadata where
condition?
Example:
result = collection.query(
query_texts=["This is sample query text"],
where={"string_type_metadata_field": {"$contains": "substring"}}
)
Describe the proposed solution
- I would have loved to see if collection metadata can contain fields of any type. Examples list, map, set, json, etc. Today only supported operations seems to be:
str, int, float or bool
- When querying with
where
keyword, I would love to see more operations supported like string/list/map/setcontains
keywords. Today only supported operations seems to be:$gt, $gte, $lt, $lte, $ne, $eq, $in, $nin
Alternatives considered
No response
Importance
would make my life easier
Additional Information
No response
I fully support the proposed features. One addtion:
Besides $contains, I would also appreciate $regex (as in MongoDB: @Link)
Thanks for the excellent work so far!
I also needed more filtering possibilities, so I went to investigate what can be done. The collections are implemented as SQL databases, so I don't think supporting more complex metadata would be possible (correct me if I'm wrong :)).
However, additional operators, such as the $regex
@nielscs mentioned and something similar to the $contains
@abrhaleitela mentioned, can be implemented, and they can also serve as a workaround for not having more complex metadata.
For example, it would be ideal for my use case to have a list in one metadata field and then filter the database based on what is in the list. I implemented the $like
operator for the where
operation and the $regex
operator for both the where
and where_document
operations, and I was able to simulate the behavior I needed using these.
I created a pull request with these changes (https://github.com/chroma-core/chroma/pull/1393) ; hopefully that helps!
I am confused by your examples. You are saying that you want to apply this filtering on list metadata, but looking at your examples I don't see lists as metadata but just strings. I have the same problem, so I guess I have to make my list metadata into a string and then apply the like operator to see if the string contains my substring?
Not having a simple $like
operator like in most SQL-based databases is almost a deal breaker to me and I realized the option is missing after setting up a lot of code to use Chroma. Even if Chroma cannot offer something as powerful as a $regex
at least $contains
(LIKE '%string%') would be greatly appreciated.
@pevogam, we have a pending PR on this https://github.com/chroma-core/chroma/pull/1196. Adding these operators is not that difficult, but the team is mindful of adding operators that might be difficult to carry over to distributed/hosted version of Chroma.
@pevogam, we have a pending PR on this #1196. Adding these operators is not that difficult, but the team is mindful of adding operators that might be difficult to carry over to distributed/hosted version of Chroma.
Thanks @tazarov for linking the PR here for those that end up investigating for issues first. In case the functionality should be disabled in certain applications and can easily be made available in others perhaps we can simply detect the type of use and disable it? But I will check for details now in the PR.
Author mentioned adding lists to metadata. Is it something that might happen eventually ?
The way I store lists is by doing str(my_list)
.
I wholly support this. I find it a bit silly that metadata can only be strings. In my use case, for example, I have a list of documents extracted from a pdf, where each document is a page. That document contains an outline and an index, and I would love to add a list of keywords (extracted from the outline and/or index) to the metadata of each page. But right now the best I can do is save the list as a string and then split the string again when I need to consume the metadata, which is a silly extra step that is also error prone if you're not careful about your separators.
Complex metadata support is much needed.
Yes please, at least a string-contains query in metadata would go a long way
Complex metadata support such a list of strings is needed ASAP. Especially when other vector dbs (Qdrant) support this.