soda-core
soda-core copied to clipboard
Allow Anomaly Detection to be performed on any metrics
In soda-core metrics are mostly implicit (i.e. they are derived by checks). Setting up AD on any of the metrics derived from a check is therefore tricky from a language (and potentially execution) standpoint.
We have a few challenges:
- AD refers to a metric in SodaCL
anomaly score for {metric} < {threshold} - in its current form, AD expects to get this metric from Soda Cloud's metric store.
- If we want to perform AD on say "the average order price in the orders table", we cannot simply do
anomaly score for avg(order_price)as AD Check does not have a way to derive the metric.
How could we solve it?
- We could decide to add a metric derivation step in
AnomalyCheckthat derives a metric when the parser encounters a known metric derivation syntax such asavg(order_price), sum(order_price) duplicate_count(order_price)and so on. This would mean, that AD does not only get metrics from cloud but must be able to derive its own metric at evaluation time, and use historical metrics from past evaluations. - We could decide to allow AD to refer or "tag on" to a check so that it uses that check's metric instead of deriving its own metric. This method is actually closer to how AD works currently in soda cloud v2: Users, push metrics, or set up monitors which derive metrics, AD is set up on Cloud based on an available set of metrics.
- Could we have a situation where we have both? AD could be configured as a top level key and therefore derive its own metric, or it could be added to any check as a nested level and use the metric from that test.
I'm curious what your thoughts and opinions are @vijaykiran and @tombaeyens on this as this is going to be purely a soda-core problem to solve.
I don't think there is a problem in the language. The most basic form should anomaly check should be that for a given metric, there should be no anomalies.
There are several ways how we can express this. Previously we talked about this notation:
checks for CUSTOMERS:
- anomaly score for row count < default
- anomaly score for missing(id) < default
- anomaly score for missing(other_column) < default
in general this notation is
checks for CUSTOMERS:
- anomaly score for {metric} < default
I actually think we should reopen that discussion and consider the alternative notation:
checks for CUSTOMERS:
- row count anomaly detection
- missing(id) anomaly detection
more generic
checks for CUSTOMERS:
- {metric} anomaly detection
Later more... meeting now
Thanks for this @tombaeyens. So I think language-wise it's not really an issue. However, I think where the trick is is going to be in the metric derivation.
Currently, if we do any of:
- anomaly score for missing(id) < default
- avg(some column) anomaly detection
or even suppose some other checks which have some kind of nested properties like missing values this notation becomes a little complicated and most importantly figuring out how the metric is derived.
In the examples above for example, how might we make sure that the metric is derived? We probably have to figure out how to parse the nested metric, make sure our execution calls upon the right metric collection, and so on.
I don't have a very strong opinion on how to do it as for my side, as long as AD gets the data handed over from core as it does now nothing changes, so I would defer to you @tombaeyens and @vijaykiran to come up with the most soda-core-ish design pattern you want to follow.