chroma icon indicating copy to clipboard operation
chroma copied to clipboard

[Feature Request]: Expose sorting/ordering to query/get path

Open HammadB opened this issue 1 year ago • 15 comments

Describe the problem

Users want to have the ability to sort/order their queries/gets.

Describe the proposed solution

We should allow controlling the sort order of a query or a get.

Alternatives considered

The alternative is users sort the results manually.

Importance

would make my life easier

Additional Information

No response

HammadB avatar May 05 '23 17:05 HammadB

Hi, any news regarding this?

alexkapu avatar Dec 26 '23 12:12 alexkapu

@alexkapu not immediately - do you mind sharing your use case?

jeffchuber avatar Dec 29 '23 22:12 jeffchuber

I can share mine. I'm ingesting slack data and have a metadata field message_length with the character count. there's a lot of repetitive messages like "<service name> incident" sort of thing. I want to be able to favor messages with higher message_length for some searches because they're probably higher quality content. haven't thought it through much so don't know if that makes a ton of sense, just ran into this thread while prototyping.

stevenaldinger avatar Jan 03 '24 06:01 stevenaldinger

without being able to sort by a field, I'm not sure I can reliably paginate either (assume content is being added while I'm paginating) but add that to the list of things I haven't actually tried yet

stevenaldinger avatar Jan 03 '24 06:01 stevenaldinger

@stevenaldinger this is a great use case - we would like to support this

jeffchuber avatar Jan 05 '24 06:01 jeffchuber

I would also like to share my use case: I'm ingesting annual budget documents from 2010-2024 and would like to prioritize retrieving documents that were uploaded/updated more recently, which are more likely to be accurate/contemporarily relevant. For each vector I would imagine having a last_updated field in the metadata, then performing an ORDER BY last_updated DESC sort of the vectors and searching the top k results.

Brainana avatar Mar 20 '24 01:03 Brainana

same here, looking to sort according to a numerical metadata field in the absence of sufficiently rich prompt text, so that I dont need to maintain a vectordb and a relational db for the same data

clausagerskov avatar Apr 25 '24 12:04 clausagerskov

I would also like to share my use case: I'm ingesting annual budget documents from 2010-2024 and would like to prioritize retrieving documents that were uploaded/updated more recently, which are more likely to be accurate/contemporarily relevant. For each vector I would imagine having a last_updated field in the metadata, then performing an ORDER BY last_updated DESC sort of the vectors and searching the top k results.

I have a similar use case where in I am injecting Macro economy data on a weekly/monthly basis and I am getting the older results first - I would like to get the newly added results first.

sreeram1658 avatar Jul 01 '24 08:07 sreeram1658

Our use-case might be a bit different, but I think it's a good one. We are ingesting Confluence pages and text-splitting/storing them. The problem is that, in order to get good similarity results back, we have to break a Confluence page down into smaller Chroma "documents". When we get a good similarity search result back, we want to query for the remaining documents with the same Confluence page id (in metadata), but in order to feed that context into our AI prompt, these documents should be in order as they appear on the web page (the same order that we stored them)

Being able to sort by the ids in the query would be ideal, but we can also sort in Python after the query; it's just much more of a pain using langchain's abstraction over chroma, since the get method returns document data and metadata in separate lists.

ewilliams-zoot avatar Jul 08 '24 20:07 ewilliams-zoot

@alexkapu not immediately - do you mind sharing your use case?

ChromaDB allows limit and offset in a get(). How is it sorted before the limit and offset are applied? I'd assume based on the query terms (if any, and probably the ID otherwise). It would be nice to specify the sort field.

baughmann avatar Jul 13 '24 19:07 baughmann

I also have use case of needing to sort based on a timestamp metadata field.

driphireweb avatar Jul 31 '24 07:07 driphireweb

@baughmann, Chroma uses the document IDs for sorting the results:

https://github.com/chroma-core/chroma/blob/cbc499a732817d0732b919e2b8f7256e26588356/chromadb/segment/impl/metadata/sqlite.py#L149

This means, in a way, you can control how results are sorted (tip: Don't use UUIDv4).

What I understand from the above there are several distinct use cases:

  • Sort order (desc/asc)
  • Sort by date - each document as it is added has a timestamp, so that can be an easy win - https://github.com/chroma-core/chroma/blob/cbc499a732817d0732b919e2b8f7256e26588356/chromadb/migrations/metadb/00001-embedding-metadata.sqlite.sql#L6
  • Sort by metadata - current query structure allows it, but performance should be measured.

tazarov avatar Jul 31 '24 13:07 tazarov

@tazarov Thanks for getting back to me and confirming my guess about sorting by ID :)

The last use case--sort by arbitrary metadata--would probably be the most widely useful

baughmann avatar Jul 31 '24 17:07 baughmann

@baughmann is this feature already implemented?

driphireweb avatar Aug 04 '24 02:08 driphireweb

@driphireweb Not as far as I've been able to find, but I'd be happy to be incorrect

baughmann avatar Aug 04 '24 20:08 baughmann