vespa icon indicating copy to clipboard operation
vespa copied to clipboard

Add support for filtering in grouping over multi-valued fields

Open jobergum opened this issue 3 years ago • 9 comments

Currently if you group on a multi-valued field grouping uses all values.

There are use cases where one wants to filter the values and only group over the values which matches the filter. A start could be to have functionality similar to matched-elements-only which controls what part of the multi-valued field is returned in the summary, it could be extended to only expose matched elements to grouping. The best would end state would be able to add a filtering constraint language in the grouping language.

jobergum avatar Dec 04 '20 09:12 jobergum

I'll try to make up an example to add more context and so you can continue scoping the goal of this issue.

Imagine I have this schema:

schema my_doc {
    document my_doc {
        field id type string {
            indexing: attribute | summary
            stemming: none
            attribute {
                fast-search
            }
        }

        struct my_struct {
            field my_type type string {}
            field my_value type int {}
        }

        field my_array type array<my_struct> {
            indexing: summary
            stemming: none

            struct-field my_type {
                indexing: attribute | summary
            }

            struct-field my_value {
                indexing: attribute | summary
            }
        }
    }
}

I would like to find documents of type my_doc by filtering on their multivalue field my_array, and then I would like to group each my_struct struct and aggregate on them separately, instead of aggregating on the documents that matched.

For example, let's take following YQL query:

select * from my_doc
| all(group(my_array.my_type) max(10000) each(
    output(avg(my_array.my_value))
  ));

For this query, Vespa will match all my_doc documents and then group them by my_array.my_type. This means that each value found of my_array.my_type will have it's own group of my_doc documents, which can appear in other groups for other values. Then, when avg(my_array.my_value) is calculated, it's calculated on all of the my_struct structs in the my_array field of all the documents in the group.

This is all known, but I'm just giving details. Personally, I think this is very misleading, but this functionality can stay as is and we can add more functionality by expanding YQL. As such, I have the following ideas for which you can give feedback on.

* Grouping on multivalue fields.

An Elasticsearch style could be taken. You "nest" inside the multivalue field and once there you're grouping and aggregating at this level, instead of on the document level. Here's an example of how the YQL could look like for retrieving the data I want:

select * from my_doc
| all(nested(my_array) all(
    group(my_array.my_type) max(10000) each(
      output(avg(my_array.my_value))
    )
  ));

I can also imagine this new operator being used at any level. For example, group per document and then group and aggregate their multivalue fields:

select * from my_doc
| all(group(id) max(10000)
    all(nested(my_array) all(
      group(my_array.my_type) max(10000) each(
        output(avg(my_array.my_value))
      )
    ));
  );

* Filtering on multivalue fields.

As a next step, it would be great to allow to filter what multivalue fields you want to group. For example, I only want to group on my_struct with a specific my_struct.my_type.

select * from my_doc
where weightedSet(my_array.my_type, {"FOO": 1})
| all(nested(my_array) where(
    (my_array.my_type == "FOO") all(
      group(my_array.my_type) max(10000) each(
        output(avg(my_array.my_value))
      )
    ))
  );

I hope this is enough context for you to continue scoping this ticket. I've not worked enough with Vespa and YQL to be sure that my proposals for expanding YQL are the best options, so feel free to give your feedback and opinions on how the YQL should look.

Kind regards!

MauDagos avatar Feb 19 '21 12:02 MauDagos

+1 on this request. I have a document (describing an item with a SKU) which has a array of structs (attributes) that describes where in a menu-hierarchy an items fits (can be 1 to n places for 1 single document/SKU).

For example one article/document (a plug) needs to appear on these 3 levels in the menu: "menuetree": [ { "level1": "Service", "level2": "Service/maintenance", "level3": "Oil pan drain plug" }, { "level1": "Engine", "level2": "Basic engine", "level3": "Plug, crankcase" }, { "level1": "Engine", "level2": "Lubrication", "level3": "Oil pan drain plug" } ]

on YQL quey i filter on level1 Engine but since grouping (in this case on level2) is on document level (the item/SKU) "Service/maintenance" (even though it belongs to Service) will also be included in the result. Now i can solve that with post process the resulting answer and clean out non matching entries etc. Still would have been nice if array/map followed the filter criteria.

lundin avatar Feb 19 '21 15:02 lundin

Thanks, I like these suggestions!

We hope to be able to start working on this in our next sprint starting February 9.

bratseth avatar Feb 24 '21 15:02 bratseth

I agree that filtering support in multivalued fields would be nice. However I would like to point out that you can achieve what you want by using parent-child relation, see https://docs.vespa.ai/en/parent-child.html.

Then you would query over the child and you should be able to achieve the behaviour you want.

  • With array you are only able to achieve 1 extra level as we do not support deeper nesting in indexed mode.
  • With parent-child relations you are able to support arbitrary deep nesting, as parents can have their own parents too.
  • However parent-child solutions have the restriction that they must all fit on each node, but that might often be fine if they just contain som small meta data.
  • A parent-child relation will also be faster as you would then only consider matching documents. Especially if the filter ratio is large. If only 1/N values will actually be grouped the gain will be close to N.

If most of your data are in the array<my_struct> field I would probably choose a parent-child approach. But if it is a small part of your document, not naturally separated I would go for the array approach.

This being said I do think that adding a filter criteria to grouping will make sense. Just make sure you select the proper tool for the job. Going out in the garage for the sledgehammer when you need a small nail in the wall for hanging a small picture might be impressive, but probably not the better solution in the long run.

baldersheim avatar Mar 03 '21 11:03 baldersheim

Thanks for the reminder of parent-child. However, for our use-case, it was recommended by the Vespa team itself that we don't follow that structure because of two reasons (which can't be seen from my simple example in my first comment):

  1. We're dealing with millions and millions of documents. With parent-child relationship parent documents need to be on all nodes, so scaling becomes problematic.
  2. Imported fields from parent documents can only be attributes, which do not support free text matching. For our use case we need free text search on fields on both the parent and the child documents. There's a ticket open for this feature: https://github.com/vespa-engine/vespa/issues/12333

MauDagos avatar Mar 03 '21 12:03 MauDagos

I think this one is due now.

I suggest we split this in 2 as it is 2 orthogonal features. 1 - Nesting. This is the natural behavior you would expect when grouping and aggregating over structured data. If we allow the semantic change for structured fields this can be done without any yql changes. All the structured information is in the schema already and can be extracted directly runtime in the backend.

2 - Filtering. This is an orthogonal feature that might also make sense for multi value primitive fields.

baldersheim avatar May 31 '23 05:05 baldersheim

Let's use this issue for filtering (item 2).

I created https://github.com/vespa-engine/vespa/issues/27238 on nesting (item 1).

bratseth avatar May 31 '23 09:05 bratseth

+1 on this request.

ernestas-poskus avatar Sep 22 '23 13:09 ernestas-poskus

we will continue working on this in Q4

kkraune avatar Sep 27 '23 12:09 kkraune