chroma
chroma copied to clipboard
[Feature Request]: Expose sorting/ordering to query/get path
Describe the problem
Users want to have the ability to sort/order their queries/gets.
Describe the proposed solution
We should allow controlling the sort order of a query or a get.
Alternatives considered
The alternative is users sort the results manually.
Importance
would make my life easier
Additional Information
No response
Hi, any news regarding this?
@alexkapu not immediately - do you mind sharing your use case?
I can share mine. I'm ingesting slack data and have a metadata field message_length
with the character count.
there's a lot of repetitive messages like "<service name> incident"
sort of thing.
I want to be able to favor messages with higher message_length
for some searches because they're probably higher quality content.
haven't thought it through much so don't know if that makes a ton of sense, just ran into this thread while prototyping.
without being able to sort by a field, I'm not sure I can reliably paginate either (assume content is being added while I'm paginating) but add that to the list of things I haven't actually tried yet
@stevenaldinger this is a great use case - we would like to support this
I would also like to share my use case: I'm ingesting annual budget documents from 2010-2024 and would like to prioritize retrieving documents that were uploaded/updated more recently, which are more likely to be accurate/contemporarily relevant. For each vector I would imagine having a last_updated
field in the metadata, then performing an ORDER BY last_updated DESC
sort of the vectors and searching the top k results.
same here, looking to sort according to a numerical metadata field in the absence of sufficiently rich prompt text, so that I dont need to maintain a vectordb and a relational db for the same data
I would also like to share my use case: I'm ingesting annual budget documents from 2010-2024 and would like to prioritize retrieving documents that were uploaded/updated more recently, which are more likely to be accurate/contemporarily relevant. For each vector I would imagine having a
last_updated
field in the metadata, then performing anORDER BY last_updated DESC
sort of the vectors and searching the top k results.
I have a similar use case where in I am injecting Macro economy data on a weekly/monthly basis and I am getting the older results first - I would like to get the newly added results first.
Our use-case might be a bit different, but I think it's a good one. We are ingesting Confluence pages and text-splitting/storing them. The problem is that, in order to get good similarity results back, we have to break a Confluence page down into smaller Chroma "documents". When we get a good similarity search result back, we want to query for the remaining documents with the same Confluence page id (in metadata), but in order to feed that context into our AI prompt, these documents should be in order as they appear on the web page (the same order that we stored them)
Being able to sort by the ids in the query would be ideal, but we can also sort in Python after the query; it's just much more of a pain using langchain's abstraction over chroma, since the get method returns document data and metadata in separate lists.
@alexkapu not immediately - do you mind sharing your use case?
ChromaDB allows limit
and offset
in a get()
. How is it sorted before the limit and offset are applied? I'd assume based on the query terms (if any, and probably the ID otherwise). It would be nice to specify the sort field.
I also have use case of needing to sort based on a timestamp metadata field.
@baughmann, Chroma uses the document IDs for sorting the results:
https://github.com/chroma-core/chroma/blob/cbc499a732817d0732b919e2b8f7256e26588356/chromadb/segment/impl/metadata/sqlite.py#L149
This means, in a way, you can control how results are sorted (tip: Don't use UUIDv4).
What I understand from the above there are several distinct use cases:
- Sort order (desc/asc)
- Sort by date - each document as it is added has a timestamp, so that can be an easy win - https://github.com/chroma-core/chroma/blob/cbc499a732817d0732b919e2b8f7256e26588356/chromadb/migrations/metadb/00001-embedding-metadata.sqlite.sql#L6
- Sort by metadata - current query structure allows it, but performance should be measured.
@tazarov Thanks for getting back to me and confirming my guess about sorting by ID :)
The last use case--sort by arbitrary metadata--would probably be the most widely useful
@baughmann is this feature already implemented?
@driphireweb Not as far as I've been able to find, but I'd be happy to be incorrect