vespa icon indicating copy to clipboard operation
vespa copied to clipboard

Allow filtering grouped values by prefix or regex

Open angelf opened this issue 2 years ago • 1 comments

Is your feature request related to a problem? Please describe.

Consider a classification scheme where each document can be classified in multiple categories, and categories form a hierarchy and there are many. As an example, let's say the classification schema contains

  genre
     genre/poetry
     genre/biographies
     genre/fiction
     
  styles
     styles/symbolism
     styles/realism
          styles/realism/neorealism
     styles/fantasy
          styles/fantasy/high-fantasy          
          styles/fantasy/medieval

And we may have docs classified in multiple categories, possible at the same level.

  doc1: categories={genre/poetry styles/symbolism}
  doc2: categories={genre/fictions styles/fantasy/high-fantasy styles/fantasy/medieval}

When browsing a certain category (styles/fantasy) we are interested in grouping ("faceting") search results, but only showing categories that are under the current path. It is also important to provide the complete result set.

Describe the solution you'd like

The preferred option would be to add a grouping function that can filter values that do not start with a certain prefix. So for example:

all( group(filter_prefix(category, "styles/fantasy/")) each(output(count())) )

Would discard all groups whose value does not start with "styles/fantasy/"). As with other expressions the computation would occur at each node, and so network bandwidth would be greatly reduced.

filter_prefix might completely omit the group, or replace the value with an empty string (both would solve the problem) or a string selected by the user. For example:

all( group(if_starts(category, "styles/fantasy", category, "alternative")) each(output(count())) )

Describe alternatives you've considered

  1. A first approach is to group by all values (all( group(category) each(output(count())) )), and then filter out the ones that don't belong to the current context. But this may require a very large maxHits to assure that the values of interest are actually included in the results, and it will be inefficient. On large taxonomies it'll make hard to provide assurances that the result set is complete.

  2. Creating one field for each level ("category1", "category2", "category3") attenuates but does not solve the problem since documents can be in multiple categories at different hierarchy points; so we are still at risk of not providing the complete result set.

  3. A more general but maybe less efficient approach would be allow regex filtering

    all( group(if_regex_matches(category, "styles/fantasy", category, "alternative")) each(output(count())) )

  4. A new expression syntax rather than a function would may be more natural, but probably requires more aggressive changes. For example:

    all( group(category) if_prefix("styles/fantasy") each(output(count())) )

Additional context See originating discussion on: https://vespatalk.slack.com/archives/C01QNBPPNT1/p1654876998447789

angelf avatar Jun 15 '22 10:06 angelf

Thanks for the nice writeup @angelf, this request also relates to https://github.com/vespa-engine/vespa/issues/15658.

jobergum avatar Jun 16 '22 09:06 jobergum