soda-core icon indicating copy to clipboard operation
soda-core copied to clipboard

Allow Anomaly Detection to be performed on any metrics

Open bastienboutonnet opened this issue 3 years ago • 2 comments
trafficstars

In soda-core metrics are mostly implicit (i.e. they are derived by checks). Setting up AD on any of the metrics derived from a check is therefore tricky from a language (and potentially execution) standpoint.

We have a few challenges:

  • AD refers to a metric in SodaCL anomaly score for {metric} < {threshold}
  • in its current form, AD expects to get this metric from Soda Cloud's metric store.
  • If we want to perform AD on say "the average order price in the orders table", we cannot simply do anomaly score for avg(order_price) as AD Check does not have a way to derive the metric.

How could we solve it?

  • We could decide to add a metric derivation step in AnomalyCheck that derives a metric when the parser encounters a known metric derivation syntax such as avg(order_price), sum(order_price) duplicate_count(order_price) and so on. This would mean, that AD does not only get metrics from cloud but must be able to derive its own metric at evaluation time, and use historical metrics from past evaluations.
  • We could decide to allow AD to refer or "tag on" to a check so that it uses that check's metric instead of deriving its own metric. This method is actually closer to how AD works currently in soda cloud v2: Users, push metrics, or set up monitors which derive metrics, AD is set up on Cloud based on an available set of metrics.
  • Could we have a situation where we have both? AD could be configured as a top level key and therefore derive its own metric, or it could be added to any check as a nested level and use the metric from that test.

I'm curious what your thoughts and opinions are @vijaykiran and @tombaeyens on this as this is going to be purely a soda-core problem to solve.

bastienboutonnet avatar Apr 20 '22 09:04 bastienboutonnet

I don't think there is a problem in the language. The most basic form should anomaly check should be that for a given metric, there should be no anomalies.

There are several ways how we can express this. Previously we talked about this notation:

checks for CUSTOMERS:
  - anomaly score for row count < default
  - anomaly score for missing(id) < default
  - anomaly score for missing(other_column) < default

in general this notation is

checks for CUSTOMERS:
  - anomaly score for {metric} < default

I actually think we should reopen that discussion and consider the alternative notation:

checks for CUSTOMERS:
  - row count anomaly detection
  - missing(id) anomaly detection

more generic

checks for CUSTOMERS:
  - {metric} anomaly detection

Later more... meeting now

tombaeyens avatar Apr 25 '22 12:04 tombaeyens

Thanks for this @tombaeyens. So I think language-wise it's not really an issue. However, I think where the trick is is going to be in the metric derivation.

Currently, if we do any of:

  - anomaly score for missing(id) < default
  - avg(some column) anomaly detection

or even suppose some other checks which have some kind of nested properties like missing values this notation becomes a little complicated and most importantly figuring out how the metric is derived.

In the examples above for example, how might we make sure that the metric is derived? We probably have to figure out how to parse the nested metric, make sure our execution calls upon the right metric collection, and so on.

I don't have a very strong opinion on how to do it as for my side, as long as AD gets the data handed over from core as it does now nothing changes, so I would defer to you @tombaeyens and @vijaykiran to come up with the most soda-core-ish design pattern you want to follow.

bastienboutonnet avatar Apr 25 '22 15:04 bastienboutonnet