dbt-core
dbt-core copied to clipboard
[CT-711] [Draft] Add Structured Logs to Materializations
Motivation
Today, run_results.json is a static artifact written at the end of a run which contains a summary of the objects that dbt run created and updated along with an incomplete view of the sql from each materialization. This information should be more granular and be visible during a run. Structured logging is a formal interface for programmatically consuming metadata about a dbt run, and is where this more granular information will be streamed to outside consumers.
Context
The materialization macro and the adapter functions it calls are natural places to put these log lines. However, many adapters override materialiations directly (i.e. snowflake, bigquery, spark), and others override key adapter functions such as adapter.execute (i.e. bigquery) which would require these and future adapter maintainers to remember to copy our log lines exactly the same. Requiring adapter maintainers to take direct action to maintain the reliability of the structured logging interface is something we'd like to avoid, which is why this ticket puts these logs lines before and after the call to execute in statement.sql (source). Since statement.sql is not a Python file, structured logging must be accessible to jinja and today it is not.
Implementation
- [ ] Create a new function in the jinja context
log_eventlikelog(source) so that instead of taking a string msg, it takes a stringevent_nameand a list of values of typeList[Any]. This will fire an event with the same name as the string passed with the parameters in the list. (UsingEvent.__subclasses__(), and the splat operator for lists (i.e.*[foo, bar]) may help here). Mypy will not be able to catch any type mismatches here since the event constructors aren't passed concrete values till runtime. Since these log lines will run with every materialization, tests should catch any issues with these log lines. - [ ] Update SQLQueryStatus to take in the full
adapter_responseobject, and node name. Today we stringify the adapter reponse first making it much harder to consume. Special attention will have to be paid to serialization since each warehouse will be putting different values into their own responses. Using examples from today's warehouses for inputs to serialization tests is a good idea. The name of the currently running node is accessible via the "jinja god context" (TODO: link example of how to access the jinja god context) (TODO:adapter.executeis called in other places too. We should add log lines there too. Link exact lines here.)
Requiring adapter maintainers to take direct action to maintain the reliability of the structured logging interface is something we'd like to avoid
Agree in principle. I'm on the fence about sticking this in the statement macro, vs. just expanding the SQLQuery event fired in add_query. This would indeed require some work in dbt-bigquery's custom execute, which doesn't feel great, but... almost all adapters out in the world inherit from the "SQL" adapter. [Update: I remembered that dbt-snowflake also overrides add_query, though it does ultimately call super().add_query as well.]
Definitely curious to hear Gerda's thoughts on how to factor this code. Also tagging @dataders for visibility / fruit for thought: To what extent must we shy away from high-impact logging, or other important metadata, living in the adapter?
TODO: adapter.execute is called in other places too. We should add log lines there too. Link exact lines here.
The important ones I can find:
run_hooksin theruntask (as in,on-run-start/on-run-endhooks)SqlExecuteRunner, used bydbt.lib.execute_sql
We use it in our functional test utilities as well, here and here
Logging relation info as soon as it's built
This was another thing we discussed earlier today, related to this effort but possibly its own ticket
Let's add structured logging about the relations produced by the materialization, as soon as it finishes building them (here). Materializations already return the set of relations it creates/updates, for the purposes of updating dbt's cache. Why not share the wealth with programmatic consumers of dbt metadata?
(Will serializing Relation objects be an absolute nightmare? Relation objects can be reimplemented by adapter, of course, though they all inherit from BaseRelation, which should be serializable. Even so, we may not want all the object attributes included in the logging event—probably just a subset.)
For now, the only really valuable information included on the relation object is database location (database.schema.identifier) and relation type (view, table, etc). However, I could see doing two things to make this very valuable, for which this logging lays necessary groundwork:
- Additional fields like
columns(with data types) and table statistics (maybe even column-level stats, too) — a.k.a. the same basic contract asCatalogInfo - Steps at the end of each materialization that
describethe just-built relation, to populate those fields, which will then be logged out once the materialization completes
Put it all together, and we'll be able to provide realer-time access to catalog info, rather than trying to grab it all in one big memory-intensive batch during docs generate.
I've come around on this, given the way that we've implemented Python on some adapters. The statement macro has become the "gateway" for both SQL + Python code, and it will call either adapter.execute or adapter.submit_python_job (on some adapters). Rather than require logging in both of those methods, trusting adapter maintainers to implement it for us, we should have that logging in one place — statement.
We'd still want to handle cases where:
- We hook into adapter.execute directly today (linked above). It's theoretically possible to call
adapter.executedirectly from the Jinja context, bypassing thestatement/run_querysituation — though it's not something we document. - It's theoretically possible for an end user to override the
statement/run_querymacro... though I think a LOT of things would break :)
In python model we currently added a decorator to log things.
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.
Assigning myself to reopen as 2-3 separate issues, and then close this one
Closing in favor of #6731, #6732, #6733