soda-core
soda-core copied to clipboard
Group by use cases
Here I provide a list of use cases related to group by. Note that some of these are already available, while for other use cases we are still collecting requirements. I would like to hear which of these variants would cover your actual use cases and which are most important for you. Also I want a description in your own words of your group by use case: eg For every combination of shop & country, we want to verify that ...
Group-by query in failed rows
checks:
- failed rows:
name: Every shop must have at least 10 cases and 5 products
fail query: |
WITH groups AS (
SELECT
shop,
country,
sum(cases) as total_cases,
sum(products) as total_products
FROM CUSTOMERS
GROUP BY shop, country
)
SELECT *
FROM groups
WHERE total_cases < 10 or total_products < 5
This will fail if there are rows present in the returned query and pass if there are no rows coming from the query.
In case there are rows returned, those can be captured as failed rows for diagnostic purposes.
It's possible to configure where those failed rows are stored, but it may involve plugging in your own, custom sampler implementation.
This check will not take any historic group information into account. Each scan, the groups are computed and the check can only verify the groups present at that time.
(This is support in the current version of Soda Core)
Example diagnostic result for a single scan in this flavor:
shop | country | total_cases | total_products |
---|---|---|---|
m&s | UK | 3 | 17 |
hema | NL | 5 | 3 |
Metrics based group by checks
(This is idea stage and not yet implemented)
checks for CUSTOMERS:
- group by:
query: |
SELECT
shop,
country,
sum(cases) as total_cases,
sum(products) as total_products
FROM CUSTOMERS
GROUP BY shop, country
group fields: [ 'shop', 'country' ]
checks:
- total_cases < 10
- total_products < 5
This example would to the same check as above but with 2 differences:
Because of this different configuration, Soda Core is now able to capture the numeric metric values per group. Soda Core could send the individual group values for total_cases
and total_products
to the metric store in Soda Cloud so that all features of metric based checks become available on group values like:
- checking against historic diffs
- anomaly detection
A separate but (i think) less important difference is that all checks would be modeled as individual checks. This is in contrast to the failed rows approach above where all checks are baked into the query and hence in a single check.
The downside with this approach is that we would have to place limits on the size of the group-by query result as we have to store it in our metrics store on Soda Cloud.
In this case the diagnostics would be split up by check:
Diagnostics for total_cases < 10
:
shop | country | total_cases |
---|---|---|
m&s | UK | 3 |
hema | NL | 5 |
Diagnostics for total_products < 5
:
shop | country | total_products |
---|---|---|
m&s | UK | 17 |
hema | NL | 3 |
Group changes
A 3rd flavor of group by checks would be to alert on groups appearing or disappearing.
Example:
checks for CUSTOMERS:
- group by:
query: |
SELECT
shop,
country,
sum(cases) as total_cases,
sum(products) as total_products
FROM CUSTOMERS
GROUP BY shop, country
group fields: [ 'shop', 'country' ]
checks:
- when groups appear:
level: warn
- when groups disappear:
level: fail
Questions
To what extend to the above flavors of group by checks cover your use cases?
To what extend do the diagnostics fit with your issue resolution workflow?
I have one addition to what we already discussed about group by. Since BigQuery allows to use type STRUCT, we were thinking about outputing data in that way. Table created that way would have columns organized like in this example: