dbt-core
dbt-core copied to clipboard
[CT-86] [Feature] Write All Store Test Failures Into One Table with an Unstructured JSON/VARIANT/SUPER Column
Is there an existing feature request for this?
- [X] I have searched the existing issues
Describe the Feature
Right now store-failures in dbt writes out the rows returned for each failed test into a separate table if --store-failures
is specified. This feature is definitely helpful from a developer perspective to be able to immediately see what rows failed in a test by querying the returned table.
It is somewhat less useful in its current incarnation in a production environment where jobs and tests may run many times per day and the operations team would like to have a unified, easily queryable database-persisted record of all tests. Even better perhaps would be a more automated way of pushing test failures back to users.
One way of doing this would be to create one table where all test failures are written to. The big obstacle to this on some platforms, such as BigQuery, has been the lack of a semi-structured native table/column format. Snowflake already has VARIANT and Redshift now has SUPER. This is important because different failed tests on different tables are naturally going to have different schemas that can't easily be reconciled in one table where columns have to be declared ahead of time.
BigQuery has now just added a JSON datatype: https://cloud.google.com/bigquery/docs/reference/standard-sql/json-data
What if we now had a table with a set of columns where all test failures would be stored:
- Failed Test Name (this is where being able to explicitly name tests rather than use the default generated name gets really interesting)
- Failed Test Model Name
- Failed Test Timestamp
- Failed Test Owner
- Failed Test Description (I'm less sold on this column, maybe it's overkill)
- Failed Test Data - Where the entire row which failed gets inserted as JSON for future querying as needed
Maybe even more interesting would be the ability to group different failed tests into different long-term tables for storage. Something like this in dbt_project.yml:
tests:
+store_failures: true
my_project:
folder_1:
+table_for_failed_tests_for_this_folder: really_sensitive_test_failures_table
folder_2:
+table_for_failed_tests_for_this_folder: not_so_sensitive_test_failures_table
Then I could go back and write a report which issues a query to the database like this:
SELECT * FROM not_so_sensitive_test_failures_table WHERE owner = '[email protected]' AND failed_test_timestamp > (CURRENT_TIMESTAMP - INTERVAL 24 HOURS)
What say ye?
Describe alternatives you've considered
Writing our own macros to parse INFORMATION_SCHEMA to find failed tests in the schema that dbt creates for storing failures and then unifying everything that dbt had put in there into one JSON-style table. But doing it right in dbt would be a lot more elegant.
Who will this benefit?
Anybody trying to store test failures over time and then push out responsibility for correction outside of the data engineering team.
Are you interested in contributing this feature?
Yes - I poked around in the code for 10 minutes in dbt-core and didn't obviously see where to start editing but definitely interested
Anything else?
No response
Another possible related thought on this. What if the table into which test results are stored actually became a node in the DAG? Given that tests are already nodes in the DAG, these tables/models would become nodes downstream of each test. And then it would be possible to have models which in turn are built on top of the these failure storage tables which would be very helpful for the purposes of data quality management.
That idea could be combined with my original proposal in this ticket to allow larger combined tables, with the semi-structured format suggested, to then be used by downstream analytics. If the failure storage tables were treated as similar to snapshots, they also then wouldn't be truncated/dropped-and-recreated on the next dbt run. Which would also make it easier to see, if when dbt is re-run, the same error is repeated again or whether it no longer exists.
Hi @codigo-ergo-sum! Thanks for such a well thought out feature request!
A few ideas:
-
A lot of this can be achieved via the cloud metadata API. The main thing missing in this approach would be the actual content of the stored failure rows, so the feature here would be to provide a linking id from the metadata API to the stored test failures. It's also only valid if you use cloud :)
-
Another approach would be to use
on-run-end
hooks + the results object. This would require some macro writing, but it's fairly straightforward to implement. -
All of that said, some version of this idea has been coming up quite a bit recently so I'm going to open a new ticket (scoped slightly smaller) to give some careful consideration to how we handle
store-failures
-- we can certainly make it more useful for this kind of thing.
Would either or those first two ideas be suitable (or close to suitable) for your use case?
Oh, and regarding adding the stored failures as a node in the DAG-- I love the idea! We'll have to see where #4624 takes us
FYI we are getting around this by setting up the returned failed test record columns to be all nearly identical, with only minor variations between them (also using BQ and batched runs). That way we can just simply run a SELECT * FROM project.dataset.*
I agree would be awesome to have this in the DAG and to be able to combine and store failed records from batch runs incrementally.
I'm going to try using on-run-end post-hook combined with a macro, and then we also just got JSON support enabled for our BQ environment today (it's a special thing that has to be turned on by request.) So I'll report back :).
@codigo-ergo-sum how'd you go on your test for this? We're looking to start cleaning up our schemas and having something like this would be incredibly valuable :)
I started to store all our testing failures but we sparsely use it so it's getting hard to traverse the schema it stores it to :(
So I did this initial take for BigQuery, basing off of code in this issue (which was for Snowflake): https://github.com/dbt-labs/dbt-core/issues/4099:
{% macro centralize_test_failures(results) %}
{%- set test_results = [] -%}
{%- for result in results -%}
{%- if result.node.resource_type == 'test' and result.status == 'fail' and (
result.node.config.get('store_failures') or flags.STORE_FAILURES
)
-%}
{%- do test_results.append(result) -%}
{%- endif -%}
{%- endfor -%}
{%- set central_tbl -%} {{ target.schema }}.test_failure_central {%- endset -%}
{{ log("Centralizing test failures in " + central_tbl, info = true) if execute }}
create or replace table {{ central_tbl }} (test_name STRING, test_failures_json JSON, test_ts timestamp) as (
{% for result in test_results %}
select
'{{ result.node.name }}' as test_name,
to_json(t) as test_failures_json,
current_timestamp() as test_ts
from {{ result.node.relation_name }} as t
{{ "union all" if not loop.last }}
{% endfor %}
)
{% endmacro %}
A few thoughts:
- We have about 3700 tests in our project. The original version of the code for this macro didn't select specifically only tests that fail, so that means a very, very long SQL query unioning all the 3700 tables even though we would only want the ones with rows. BigQuery actually gives an error and won't run the query if I run all the tests in the project, this is the error message:
The query is too large. The maximum standard SQL query length is 1024.00K characters, including comments and white space characters.
So I changed fromresult.status != 'skipped'
toresult.status == 'fail
which trimmed things down. In projects that have a lot of tests that fail though, the problem will still arise so longer-term likely need to have some sort of method of batching a certain number of the failed test SQL aggregation together and then appending to it. - Some tests don't seem to actually record the whole contents of the row, which is what we want. In particular, I noticed that
accepted_values
(part of dbt core) and alsodbt_utils.not_accepted_values
don't record the whole row. I'll have to follow up on this. - I'm interesting in capturing the "owner" field from the meta object so we could start filtering different tests and exposing them to different groups that way. But I think there were some issues in getting the meta object data from the
Results
object. I'll have to go back and look more at that. - I'm not crazy about having this be part of an
on-run-end
hook. We want our full DAG to run and not stop when tests fail. So we do this:
dbt --warn-error --no-partial-parse build --full-refresh --exclude test_type:schema test_type:data && \
dbt test
So we'd have the on-run-end
hook run as part of the first step and then the second again. We don't really want that, we just want the tests to be collected and stored as part of the second step. I'd almost rather have this code run as a model and not an on-run-end hook but I'm not sure that I would be able to capture the results correctly then. Also running it as a model would then let us have other models which look at this output data downstream and perhaps do further downstream transformations on this aggregated test data. This would fit nicely in with having the --store-failures
tests themselves be nodes in the DAG, it could all be more cleanly orchestrated.
Time permitting, I'll try to poke at this more next week.
+1 this is correct way of storing test results instead current one.
- an option to limit the number of rows saved could be helpful. like save only 10 rows out of failed test records!
- also and option to print the compiled test query couldl be helpful too. only when the test fails!
- an option to limit the number of rows saved could be helpful. like save only 10 rows out of failed test records!
Have a look at https://docs.getdbt.com/reference/resource-configs/limit - I think you'll find it does what you need
- also and option to print the compiled test query couldl be helpful too. only when the test fails!
Do you mean printing to the console, or just accessing the compiled code in general? All compiled tests are stored in your target directory where you can look at them - https://docs.getdbt.com/docs/faqs/checking-logs. If you wanted the whole compiled code to be printed to the console, I'd suggest opening a separate issue for that.
Also see my comment on this closed ticket that is definitely related: https://github.com/dbt-labs/dbt-core/issues/2593#issuecomment-651722106
@codigo-ergo-sum Have you made any more progress with this using on-run-end hook?
+1 is there any progress on this?
Given the planned introduction of unit testing in 1.8, it'd be great to see this put on the roadmap for 1.9!
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.
Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.
Still interested in this.
Still interested in this.
same