chroma icon indicating copy to clipboard operation
chroma copied to clipboard

Enable users passing custom allowlist of ids for querying

Open jeffchuber opened this issue 2 years ago • 12 comments

This came up in Discord

Some users may want to do pre-filtering in their dbs and then pass the allowlist to vector similarity search.

It could hook in, similar to this, https://github.com/chroma-core/chroma/blob/main/chromadb/db/index/hnswlib.py#L187

The allowlist would probably be a List of the user-provided ids (strings)

jeffchuber avatar Mar 22 '23 20:03 jeffchuber

Currently we use a postgres database that holds time-series data so that the documents are queryable by a timestamp.

{ timestamp: ... , data: ..., id: ...}

I don't embed the timestamp so I would like to filter by timestamp before applying the search in chroma. With the allowlist, I can run my query in postgres first to get the allowed document IDs, and then apply the vector similarity search on the embedded data!

This would also allow us to pre-query by stuff like geographic coordinates, specific user, etc.

jakexia72 avatar Mar 22 '23 20:03 jakexia72

@jakexia72 one thing that would work today is to pass the timestamp as metadata and store it inside chroma. You can then filter by that. Did you happen to check that out?

jeffchuber avatar Mar 22 '23 20:03 jeffchuber

I gave timestamp as one example but typically the queries we run in postgres involve a couple joins across different tables. We also rely on Postgis for efficient querying of geospatial data. I suppose adding more fields as metadata could work for something basic like the timestamp, but I'm not sure it would be performant with more complex filtering.

jakexia72 avatar Mar 22 '23 21:03 jakexia72

Also, how big can the allowlist be before we run into performance issues? If there are 5K ids in the list (this would be a realistic estimate for our usecase)? 10K?

jakexia72 avatar Mar 22 '23 21:03 jakexia72

@jakexia72 makes sense with joins!

the cost is currently the serialization and deserialization in the REST API. We have plans to use a binary format in the future to avoid this. It should be not problem on the backend.

jeffchuber avatar Mar 22 '23 23:03 jeffchuber

Came to the Issues to request exactly this feature. Please let me know if you need additional use-cases (mine is pretty similar to @jakexia72's) or beta testers :pray:

steve-marmalade avatar Aug 14 '23 18:08 steve-marmalade

The next step here is to write up a CIP (chroma improvement proposal.) https://docs.trychroma.com/contributing#cips

Intuitively this would involve adding something like this to query

def query(
        query_embeddings: Optional[OneOrMany[Embedding]] = None,
        query_texts: Optional[OneOrMany[Document]] = None,
        n_results: int = 10,
        where: Optional[Where] = None,
        where_document: Optional[WhereDocument] = None,
        include: Include = ["metadatas", "documents",
                            "distances"],
        # new
       where_ids=['1','2','3']

) -> QueryResult
       

It could also be added to where - but I think that has too much magic in it.

If someone wants to take a pass at the CIP, that'd be great.

jeffchuber avatar Aug 29 '23 13:08 jeffchuber

worth doing, needs a CIP

jeffchuber avatar Sep 13 '23 21:09 jeffchuber

Hey there @jeffchuber, mind if I take a stab at this?

tyherox avatar Sep 14 '23 08:09 tyherox

@tyherox sure! first step is we need to write up a CIP. More about CIPs here, https://docs.trychroma.com/contributing#cips

You can see examples in https://github.com/chroma-core/chroma/tree/main/docs

Let me know if you are up to this and if you have questions!

jeffchuber avatar Sep 15 '23 16:09 jeffchuber

Cool, let me make a PR for the CIP. I'm assuming the actual work will begin after the CIP PR is accepted?

tyherox avatar Sep 16 '23 02:09 tyherox

Note: The CIP was started at #1146

reaganjlee avatar Dec 05 '23 08:12 reaganjlee