ragna Support for metadata filter suggestions in the web UI

In #484 we hardcode the available metadata. We cannot release with that. Instead we need a way to communicate this information from the backend to the web UI. For this we need

An endpoint on the API, i.e. GET /corpuses/{name}/metadata
A new abstract method on the SourceStorage class, e.g. list_metadata (name TBD). The return value should be dict[str, tuple[type, list[Any]]] with the keys being the available metadata keys and the values being a two-tuple of the type and the available values.
- The type above might also be a str, e.g. "int", "float", etc., if that makes it easier.
- The source storage might opt to return an empty list for the available values to indicate that no hints for the values are available.
This function is potentially pretty expensive as one has to query the full database and potentially extract unique values from it. Thus, while we don't need to have it from the first version, we should have caching in mind.

Lastly, I'm not sure yet if we want to make it a strictly abstract, i.e. decorating with @abstractmethod, because that would require everyone to implement this even if one doesn't want to work with corpuses. Instead we could also leave it undecorated and raise NotImplementedError instead and thus pushing the check to runtime. Thoughts?

In addition, 2. also has to be implemented on builtin source storages.

Aug 20 '24 07:08 pmeier

Afterthought to 1.: in #487 we decided to use None as sentinel for the default corpus as decided by the source storage. Not sure how this can work through the REST API though as {name} probably has to be a str and cannot be omitted. @nenb what do you think of removing the None option and instead use a string sentinel, e.g. "default"?

Aug 20 '24 07:08 pmeier

@nenb what do you think of removing the None option and instead use a string sentinel, e.g. "default"?

This seems fine to me.

The rest of the issue also seems fine to me - let's leave list_metadata undecorated for the reason that you outlined.

Aug 20 '24 14:08 nenb

GET /corpuses/{name}/metadata is not going to cut it as we potentially have multiple source storages with corpuses. Thus, {name} is not unique. I see two options

Switch to GET /corpuses/metadata and return a nested dictionary with the outer layers being the source storages and the corpus names. This amplifies the cost issue highlighted above since we now query all source storages and corpuses at once.
Switch to something along the lines of GET source-storages/{source_storage_name}/corpuses/{corpus_name} to properly address the right corpus. I'm open to use a different scheme, e.g. put the corpus first in the path or just pass the query parameters.

Maybe a combination of both is a good solution?

GET /corpuses/metadata returns everything
GET /corpuses/metadata?source_storage=Chroma returns the same object, but only having Chroma as single item in the outer dictionary
GET /corpuses/metadata?source_storage=Chroma?corpus_name=default same above, but only having a single item in the secondary outer dictionary

@blakerosenthal this might also be a good solution for 1. https://github.com/Quansight/ragna/pull/495#issuecomment-2299736466 when using the JSON approach.

Aug 21 '24 11:08 pmeier

Maybe a combination of both is a good solution? GET /corpuses/metadata returns everything GET /corpuses/metadata?source_storage=Chroma returns the same object, but only having Chroma as single item in the outer dictionary GET /corpuses/metadata?source_storage=Chroma?corpus_name=default same above, but only having a single item in the secondary outer dictionary

Makes sense, I'll implement this.

Aug 21 '24 13:08 nenb