Support for metadata filter suggestions in the web UI
In #484 we hardcode the available metadata. We cannot release with that. Instead we need a way to communicate this information from the backend to the web UI. For this we need
-
An endpoint on the API, i.e.
GET /corpuses/{name}/metadata -
A new abstract method on the
SourceStorageclass, e.g.list_metadata(name TBD). The return value should bedict[str, tuple[type, list[Any]]]with the keys being the available metadata keys and the values being a two-tuple of the type and the available values.- The
typeabove might also be astr, e.g."int","float", etc., if that makes it easier. - The source storage might opt to return an empty list for the available values to indicate that no hints for the values are available.
This function is potentially pretty expensive as one has to query the full database and potentially extract unique values from it. Thus, while we don't need to have it from the first version, we should have caching in mind.
Lastly, I'm not sure yet if we want to make it a strictly abstract, i.e. decorating with
@abstractmethod, because that would require everyone to implement this even if one doesn't want to work with corpuses. Instead we could also leave it undecorated andraise NotImplementedErrorinstead and thus pushing the check to runtime. Thoughts? - The
In addition, 2. also has to be implemented on builtin source storages.
Afterthought to 1.: in #487 we decided to use None as sentinel for the default corpus as decided by the source storage. Not sure how this can work through the REST API though as {name} probably has to be a str and cannot be omitted. @nenb what do you think of removing the None option and instead use a string sentinel, e.g. "default"?
@nenb what do you think of removing the None option and instead use a string sentinel, e.g. "default"?
This seems fine to me.
The rest of the issue also seems fine to me - let's leave list_metadata undecorated for the reason that you outlined.
GET /corpuses/{name}/metadata is not going to cut it as we potentially have multiple source storages with corpuses. Thus, {name} is not unique. I see two options
- Switch to
GET /corpuses/metadataand return a nested dictionary with the outer layers being the source storages and the corpus names. This amplifies the cost issue highlighted above since we now query all source storages and corpuses at once. - Switch to something along the lines of
GET source-storages/{source_storage_name}/corpuses/{corpus_name}to properly address the right corpus. I'm open to use a different scheme, e.g. put the corpus first in the path or just pass the query parameters.
Maybe a combination of both is a good solution?
GET /corpuses/metadatareturns everythingGET /corpuses/metadata?source_storage=Chromareturns the same object, but only having Chroma as single item in the outer dictionaryGET /corpuses/metadata?source_storage=Chroma?corpus_name=defaultsame above, but only having a single item in the secondary outer dictionary
@blakerosenthal this might also be a good solution for 1. https://github.com/Quansight/ragna/pull/495#issuecomment-2299736466 when using the JSON approach.
Maybe a combination of both is a good solution? GET /corpuses/metadata returns everything GET /corpuses/metadata?source_storage=Chroma returns the same object, but only having Chroma as single item in the outer dictionary GET /corpuses/metadata?source_storage=Chroma?corpus_name=default same above, but only having a single item in the secondary outer dictionary
Makes sense, I'll implement this.