Applying `hour_of_week` seasonality causes `days_back` and `training_period` configurations to be ignored
Describe the bug
When using hour_of_week seasonality the training set goes back to 2016 in violation of the days_back and/or training_period configurations.
To Reproduce Steps to reproduce the behavior:
1. Run anomaly test without seasonality Example
- elementary.dimension_anomalies:
name: insert_name
dimensions: [dimension_name]
time_bucket:
period: hour
count: 1
timestamp_column: created_ts
days_back: 366
backfill_days: 1
anomaly_direction: both
anomaly_sensitivity: 2
severity: warn
error_if: '>4'
warn_if: '>2'
Look at compiled code and see that in the create table statement's first CTE this is the code used to generate the date buckets:
(
select
explode(
sequence(
cast('2023-01-02T00:00:00+00:00' as timestamp),
cast('2024-01-03T22:00:00+00:00' as timestamp),
interval 1 hour
)
) as edr_bucket_start
)
2. Run anomaly test with this seasonality
Example:
- elementary.dimension_anomalies:
name: insert_name
dimensions: [dimension_name]
time_bucket:
period: hour
count: 1
timestamp_column: created_ts
days_back: 366
backfill_days: 1
anomaly_direction: both
anomaly_sensitivity: 2
severity: warn
error_if: '>4'
warn_if: '>2'
seasonality: hour_of_week
Look at compiled code and see that in the create table statement's first CTE this is the code used to generate the date buckets:
(
select
explode(
sequence(
cast(
date_trunc('hour', cast('2016-12-28 00:00:00' as timestamp)) as timestamp
),
cast('2024-01-03 22:09:07.509656+00:00' as timestamp),
interval 1 hour
)
) as edr_bucket_start
)
Expected behavior I would not expect window for the query to extend back 6 years. One year should have 52 week-hour combinations and therefore be more than enough to generate a reasonably reliable seasonality adjustment.
6 years makes more sense if the test is for week-of-year-and-hour seasonality, not week-and-hour seasonality, but that doesn't make sense to me as a way to compute this seasonality and that's not my understanding of how it works after reading the Elementary docs.
Environment (please complete the following information):
dbt packages:
packages:
- package: dbt-labs/dbt_utils
version: 1.1.1
- package: LewisDavies/upstream_prod
version: 0.7.2
- package: calogica/dbt_expectations
version: 0.10.0
- package: elementary-data/elementary
version: 0.13.0
dbt version 1.6.7
dbt-databricks version 1.6.6
Hey @garfieldthesam ,
I understand why this is unexpected, but this is actually intentional.
At first when we introduced seasonality, users didn't configure the days_back accordingly and kept seeing "not enough data to calculate an anomaly" result.
We did a naive solution -
I believe the more mature approach would be to increase the training set size only if needed, and only to reach the minimal training set size. For example -
For this configuration:
days_back: 60
seasonality: day_of_week
min_training_set_size: 14
We should not naively do days_back*7, but do:
if min_training_set_size * 7 < days back (
days_back = min_training_set_size * 7
)
else days_back
@ellakz @dapollak what do you think?
@Maayan-s ah I see but one easy thing to fix: the docs you pictured above only say this applies just to day_of_week and not hour_of_week.
I believe the more mature approach would be to increase the training set size only if needed, and only to reach the minimal training set size.
I agree with this
But, regarding this proposal:
if min_training_set_size * 7 < days back (
days_back = min_training_set_size * 7
)
else days_back
I disagree. I am pretty sure that the way most users think of min_training_set_size is as a scalar number that the elementary packages will never change. For example, if I set min_training_set_size = 70, days_back = 70, and seasonality=day_of_week, I expect elementary to run the test using 70 days (ie. 10 weeks) of data.
It might be made clear in the docs that this isn't how min_training_set_size works but not everyone is going to read the docs for every parameter they use, particularly if the name feels self-explanatory