dbt-core
dbt-core copied to clipboard
[CT-228] [Feature] Allow tests to warn/fail based on percentage
Is there an existing feature request for this?
- [X] I have searched the existing issues
Describe the Feature
Add ability to support warn/fail threshold being a percentage instead of just fixed number. We will likely need to add the definition of the total number to calculate percentage with. Original request #4334
- name: my_view
columns:
- name: id
tests:
- unique:
config:
warn_if: "> 10%"
fail_if: "> 20%"
Describe alternatives you've considered
No response
Who will this benefit?
No response
Are you interested in contributing this feature?
No response
Anything else?
No response
If this is not already earmarked for development, I'd be interested (with some guidance) in contributing this.
@jtcohen6 is this is something I could take on, do you think?
Hey @jaypeedevlin sorry for getting back to you late! We haven't scheduled when to develop this yet. Thanks for the interest!! Your contribution is greatly welcomed! I can work with you if you have any questions.
I am fairly new to this area, @jtcohen6 I took a look at the implementation, maybe we can modify here to convert warn_if
and fail_if
to something compare the fail_calc
with total_number * percentage
if a percentage is provided?
@jaypeedevlin I'd be very excited to have your help here!
The tricky piece is finding a way to get the total number of records (with where
filter included, I think). For example, in the case where warn_if: "> 10%"
and error_if: "!= 0"
, we'd want something like (half-rendered pseudo-code):
select
{{ fail_calc }} as failures,
({{ fail_calc }} / (select count(*) from {{ model }}) * 100) > 10 as should_warn,
{{ fail_calc }} != 0 as should_error
from (
{{ main_sql }}
{{ "limit " ~ limit if limit != none }}
) dbt_internal_test
The challenges I ran into when trying to do this in the past:
- We don't have
{{ model }}
available in the materialization context, in the same way that we have it as an argument to the generic test macro. In the materialization,model
refers to the test node, not the model it's selecting from. - We could hack this by duplicating logic with
{{ get_where_subquery() }}
andmodel.refs[0][0]
... which is gross :(
It's A Known Thing that we need a better way to store, for each test node, the model the test is defined on, so that it can be accessible in lots of other places—including in the test materialization. Right now it's just based on its dependency. I'd be open to Language team's input on the best way to store that.
Hey all - I was looking to do this, and saw that there's a WIP at #5172 based on statically checking for a literal percent sign and requerying the raw model to compute the percentage when that's the case.
However my first instinct when I envisioned this feature was that fail_calc
should simply be able to be a computation over all rows -- those that match the test condition and those that don't -- and that rows that don't match the condition should be signified by a specially-named boolean column. If that were possible then a percentage-based error case could be defined as something like fail_calc: 1.0 * count(dbt_test_failed) / count(*)
without a special one-off implementation in dbt-core. On top of being simpler, it would also open up a variety of other use cases that involve comparing failure cases to the broader dataset -- e.g. outlier detection, comparison to rolling averages via window functions, comparison within a windowed partition group, etc.
Would there be a straightforward way to refactor default dbt test queries so that:
- rather than (current behavior): failing cases being represented by "all rows that make it to the final result",
-
instead (proposed behavior): failing cases are represented by "all rows in the final result that have [a statically-defined magic column named e.g.]
dbt_test_failed
not equal tonull
[or maybe equal totrue
? I'm not sure which makes more sense]
?
Hmm though that breaks backcompat -- maybe add a test config called +fail_calc_includes_passing_rows
that defaults to false
? Then by default, count(*)
and the like will continue to work, and in order to do calculations that include passing rows (such as the 1.0 * count(dbt_test_failed) / count(*)
example above), you'd need to opt in via +fail_calc_includes_passing_rows: true
on the test.
@vergenzt If I understand you right, you're thinking that the not_null
test should evolve from:
select * from {{ model }}
where {{ column_name }} is null
To:
select
case when {{ column_name }} is null then true else false end as dbt_test_failed
from {{ model }}
And the fail_calc
, instead of being count(*)
, would be count(dbt_test_failed)
.
In order to calculate percentages, the fail_calc
would simply evolve to 1.0 * count(dbt_test_failed) / count(*)
, or even (in the case of more-custom tests) to 1.0 * sum(case when dbt_test_failed then some_value else 0 end) / sum(some_value)
.
That's a neat idea! I think the SQL may be a bit trickier to write for unique
, but for not_null
and accepted_values
it's intuitive enough. We'd need to rethink --store-failures
very slightly; rather than saving the whole test query result to the database (i.e. the whole table), it should probably just save:
select * from (
{{ test_query }}
)
where dbt_test_failed = true
@jaypeedevlin @ehmartens Curious to hear your thoughts here as well!
@jtcohen6 yep that's what I'm thinking! Though the example new SQL could be even simpler: select ({{ column_name }} is null) as dbt_test_failed from {{ model }}
. 🙂
If there were a way to isolate tests' condition expressions then this could be even more generic:
{%- set test_condition_expr = ... %} {#- assume condition sql is in scope; maybe via macro arg somehow? 🤷 #}
...
{%- if config.get('fail_calc_includes_passing_rows') %}
select *, ({{ test_condition_expr }}) as dbt_test_failed from {{ model }}
{%- else %}
select * from {{ model }} where ({{ test_condition_expr }})
{%- endif %}
Not sure how we'd get access to that expression in a backwards-compatible way though, given it looks like tests are currently written as full SQL select statements. 🤷
Also test_unique
could simply be:
select
{{ column_name }} as unique_field,
- count(*) as n_records
+ count(*) as n_records,
+ n_records > 1 as dbt_test_failed
from {{ model }}
where {{ column_name }} is not null
group by {{ column_name }}
-having count(*) > 1
Though the example new SQL could be even simpler
Heh, I opted for verbose-and-clear over nice-and-concise, but I agree that I like yours better :)
Another way toward backwards compatibility: as each generic test converts its SQL, it reconfigures its fail_calc
accordingly:
{% macro default__test_unique(model, column_name) %}
{{ config(fail_calc = 'count(dbt_test_failed)') }}
select
{{ column_name }} as unique_field,
count(*) as n_records,
n_records > 1 as dbt_test_failed
from {{ model }}
where {{ column_name }} is not null
group by {{ column_name }}
{% endmacro %}
At some point (dbt v2), we reset the default fail_calc
from count(*)
to count(dbt_test_failed)
. In the meantime, though it would be up to the user, to know which generic tests could now handle a percentage-based fail_calc
.
I have a conjecture that it would be possible to do this in a backwards-compatible way, without requiring changes to the many many custom generic tests already out in the world, which this issue is too narrow to contain. I'll leave it an exercise to someone else, for now, to find a way to write some even-cleverer SQL, perhaps using (gulp) a correlated subquery. Doesn't mean that's the way we ought to go.
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.
Still interested in this! (Just commenting to dismiss Stalebot)
I'm curious if there's been any updates on this functionality?
This would be very useful is it still be worked on?
Agreed that this would be very useful!
Is this still being worked on? Would be a great feature to have.
I abandoned work on this, unfortunately
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.
Still think this would be a very convenient feature
Bumping the thread, because this would be extremely useful.
Bumping again, would be a great feature to have!
Hi All, I found a workaround with this by writing a generic test in dbt and then referencing it in the yml files like any other ordinary test:
This one works for a relationships test, has a 1% error threshold, and works for BigQuery. Feel free to change as you see fit.
{% test relationships_with_1_pct_error_threshold(model, column_name, field, to) %}
with parent as (
select
{{ field }} as id
from {{ to }}
),
child as (
select
{{ column_name }} as id
from {{ model }}
),
error_threshold as (
select
cast(count(distinct id) * .01 as int) as num
from parent
)
select distinct id
from child
where id is not null
and id not in (select id from parent)
and (
select count(distinct id)
from child
where id is not null
and id not in (select id from parent)
) > (select num from error_threshold)
{% endtest %}
Adding to the bump - we've custom implemented a generic "percent-based" test macro in our shop, but having this supported by the language (or giving us additional hooks/a more standard generic interface!) would also do wonders
Bumping again, hopefully someone is working on it
Also bumping, it's very frustrating to have to write a custom test when we need the error threshold to be based on a percentage of the total number of rows in the model. This seems like something that should be baked-in.