soda-core icon indicating copy to clipboard operation
soda-core copied to clipboard

Group by use cases

Open tombaeyens opened this issue 2 years ago • 1 comments

Here I provide a list of use cases related to group by. Note that some of these are already available, while for other use cases we are still collecting requirements. I would like to hear which of these variants would cover your actual use cases and which are most important for you. Also I want a description in your own words of your group by use case: eg For every combination of shop & country, we want to verify that ...

Group-by query in failed rows

checks:
  - failed rows:
      name: Every shop must have at least 10 cases and 5 products
      fail query: |
        WITH groups AS (
          SELECT 
            shop, 
            country,
            sum(cases) as total_cases,
            sum(products) as total_products
          FROM CUSTOMERS
          GROUP BY shop, country
        )
        SELECT * 
        FROM groups 
        WHERE total_cases < 10 or total_products < 5

This will fail if there are rows present in the returned query and pass if there are no rows coming from the query.
In case there are rows returned, those can be captured as failed rows for diagnostic purposes. It's possible to configure where those failed rows are stored, but it may involve plugging in your own, custom sampler implementation.

This check will not take any historic group information into account. Each scan, the groups are computed and the check can only verify the groups present at that time.

(This is support in the current version of Soda Core)

Example diagnostic result for a single scan in this flavor:

shop country total_cases total_products
m&s UK 3 17
hema NL 5 3

Metrics based group by checks

(This is idea stage and not yet implemented)

checks for CUSTOMERS:
  - group by:
       query: |
               SELECT 
                 shop, 
                 country,
                 sum(cases) as total_cases,
                 sum(products) as total_products
               FROM CUSTOMERS
               GROUP BY shop, country
       group fields: [ 'shop', 'country' ]
       checks:
         - total_cases < 10
         - total_products < 5

This example would to the same check as above but with 2 differences:

Because of this different configuration, Soda Core is now able to capture the numeric metric values per group. Soda Core could send the individual group values for total_cases and total_products to the metric store in Soda Cloud so that all features of metric based checks become available on group values like:

  • checking against historic diffs
  • anomaly detection

A separate but (i think) less important difference is that all checks would be modeled as individual checks. This is in contrast to the failed rows approach above where all checks are baked into the query and hence in a single check.

The downside with this approach is that we would have to place limits on the size of the group-by query result as we have to store it in our metrics store on Soda Cloud.

In this case the diagnostics would be split up by check:

Diagnostics for total_cases < 10:

shop country total_cases
m&s UK 3
hema NL 5

Diagnostics for total_products < 5:

shop country total_products
m&s UK 17
hema NL 3

Group changes

A 3rd flavor of group by checks would be to alert on groups appearing or disappearing.

Example:

checks for CUSTOMERS:
  - group by:
       query: |
               SELECT 
                 shop, 
                 country,
                 sum(cases) as total_cases,
                 sum(products) as total_products
               FROM CUSTOMERS
               GROUP BY shop, country
       group fields: [ 'shop', 'country' ]
       checks:
         - when groups appear: 
             level: warn
         - when groups disappear: 
             level: fail

Questions

To what extend to the above flavors of group by checks cover your use cases?

To what extend do the diagnostics fit with your issue resolution workflow?

tombaeyens avatar Sep 02 '22 12:09 tombaeyens

I have one addition to what we already discussed about group by. Since BigQuery allows to use type STRUCT, we were thinking about outputing data in that way. Table created that way would have columns organized like in this example: Screenshot from 2022-09-02 17-35-52

machytkafitanalytics avatar Sep 02 '22 15:09 machytkafitanalytics