If a date filter is implied by the `days_back` parameter in a data volume alert, filter by the `timestamp_column` in the `monitored_table` CTE
Is your feature request related to a problem? Please describe.
Unless I'm misunderstanding how the package works, adding a days_back config to a data volume anomaly detection test means no more than days_back days worth of data will be used in evaluating the test.
I created a data volume test on a very large table that's well-optimized when queried with a filter on the table's partition timestamp column. I was surprised that my tests against this table with days_back parameters configured were timing out. I was then surprised to see that, in the code executed on the Databricks cluster, the first CTE was the following:
with monitored_table as (
select
*
from
{{very_large_table}}
),
If I hadn't set an appropriate timeout on the cluster where these tests ran it could have run up very large infra costs!
It may be the case that other anomaly tests follow a similar pattern but I have not tested.
This makes adoption of elementary more painful than it needs to be because:
- It creates a predictable (and potentially costly) failure state for anomaly detection tests
- The developer must dig into the compiled code and troubleshoot why a test that should run easily timed out
- The documentation doesn't make it clear that the
days_backparameter doesn't filter the table in question using thetimestamp_columnas one might expect
Describe the solution you'd like
Given that the test already knows the timestamp_column and the number of days_back I would expect the compiled code to look something like the following (in Spark SQL syntax), designed to filter the table down to the date range needed to evaluate the test:
monitored_table as (
select
*
from
{{very_large_table}}
where
date_trunc('{{time_bucket}}', {{timestamp_column}}) between
date_trunc('{{time_bucket}}', datesub(current_date(), {{days_back}} + 1) and
last_day('{{time_bucket}}', datesub(date_trunc('{{time_bucket}}', current_date()), 1))
),
Describe alternatives you've considered
- Update the documentation to make it clearer that the
days_backparameter does not filter the table in question, and those filters need to be applied using thewhere_expressionparameter instead - Do nothing--it's may be complicated to implement this filtering given all of the parameters that can be applied to a test
Additional context Somewhat related to but distinct from issue #1329
Would you be willing to contribute this feature? May be willing to contribute at some point Oct 2023 or later