vespa
vespa copied to clipboard
Allow filtering grouped values by prefix or regex
Is your feature request related to a problem? Please describe.
Consider a classification scheme where each document can be classified in multiple categories, and categories form a hierarchy and there are many. As an example, let's say the classification schema contains
genre
genre/poetry
genre/biographies
genre/fiction
styles
styles/symbolism
styles/realism
styles/realism/neorealism
styles/fantasy
styles/fantasy/high-fantasy
styles/fantasy/medieval
And we may have docs classified in multiple categories, possible at the same level.
doc1: categories={genre/poetry styles/symbolism}
doc2: categories={genre/fictions styles/fantasy/high-fantasy styles/fantasy/medieval}
When browsing a certain category (styles/fantasy) we are interested in grouping ("faceting") search results, but only showing categories that are under the current path. It is also important to provide the complete result set.
Describe the solution you'd like
The preferred option would be to add a grouping function that can filter values that do not start with a certain prefix. So for example:
all( group(filter_prefix(category, "styles/fantasy/")) each(output(count())) )
Would discard all groups whose value does not start with "styles/fantasy/"). As with other expressions the computation would occur at each node, and so network bandwidth would be greatly reduced.
filter_prefix
might completely omit the group, or replace the value with an empty string (both would solve the problem) or a string selected by the user. For example:
all( group(if_starts(category, "styles/fantasy", category, "alternative")) each(output(count())) )
Describe alternatives you've considered
-
A first approach is to group by all values (
all( group(category) each(output(count())) )
), and then filter out the ones that don't belong to the current context. But this may require a very large maxHits to assure that the values of interest are actually included in the results, and it will be inefficient. On large taxonomies it'll make hard to provide assurances that the result set is complete. -
Creating one field for each level ("category1", "category2", "category3") attenuates but does not solve the problem since documents can be in multiple categories at different hierarchy points; so we are still at risk of not providing the complete result set.
-
A more general but maybe less efficient approach would be allow regex filtering
all( group(if_regex_matches(category, "styles/fantasy", category, "alternative")) each(output(count())) )
-
A new expression syntax rather than a function would may be more natural, but probably requires more aggressive changes. For example:
all( group(category) if_prefix("styles/fantasy") each(output(count())) )
Additional context See originating discussion on: https://vespatalk.slack.com/archives/C01QNBPPNT1/p1654876998447789
Thanks for the nice writeup @angelf, this request also relates to https://github.com/vespa-engine/vespa/issues/15658.