elementary icon indicating copy to clipboard operation
elementary copied to clipboard

If a date filter is implied by the `days_back` parameter in a data volume alert, filter by the `timestamp_column` in the `monitored_table` CTE

Open garfieldthesam opened this issue 2 years ago • 0 comments

Is your feature request related to a problem? Please describe. Unless I'm misunderstanding how the package works, adding a days_back config to a data volume anomaly detection test means no more than days_back days worth of data will be used in evaluating the test.

I created a data volume test on a very large table that's well-optimized when queried with a filter on the table's partition timestamp column. I was surprised that my tests against this table with days_back parameters configured were timing out. I was then surprised to see that, in the code executed on the Databricks cluster, the first CTE was the following:

with monitored_table as (
  select
    *
  from
    {{very_large_table}}
),

If I hadn't set an appropriate timeout on the cluster where these tests ran it could have run up very large infra costs!

It may be the case that other anomaly tests follow a similar pattern but I have not tested.

This makes adoption of elementary more painful than it needs to be because:

  • It creates a predictable (and potentially costly) failure state for anomaly detection tests
  • The developer must dig into the compiled code and troubleshoot why a test that should run easily timed out
  • The documentation doesn't make it clear that the days_back parameter doesn't filter the table in question using the timestamp_column as one might expect

Describe the solution you'd like Given that the test already knows the timestamp_column and the number of days_back I would expect the compiled code to look something like the following (in Spark SQL syntax), designed to filter the table down to the date range needed to evaluate the test:

monitored_table as (
  select
    *
  from
    {{very_large_table}}
  where
    date_trunc('{{time_bucket}}', {{timestamp_column}}) between 
        date_trunc('{{time_bucket}}', datesub(current_date(), {{days_back}} + 1) and 
        last_day('{{time_bucket}}',   datesub(date_trunc('{{time_bucket}}', current_date()), 1))
),

Describe alternatives you've considered

  1. Update the documentation to make it clearer that the days_back parameter does not filter the table in question, and those filters need to be applied using the where_expression parameter instead
  2. Do nothing--it's may be complicated to implement this filtering given all of the parameters that can be applied to a test

Additional context Somewhat related to but distinct from issue #1329

Would you be willing to contribute this feature? May be willing to contribute at some point Oct 2023 or later

garfieldthesam avatar Sep 13 '23 02:09 garfieldthesam