dbt-core icon indicating copy to clipboard operation
dbt-core copied to clipboard

[CT-711] [Draft] Add Structured Logs to Materializations

Open nathaniel-may opened this issue 3 years ago • 3 comments

Motivation

Today, run_results.json is a static artifact written at the end of a run which contains a summary of the objects that dbt run created and updated along with an incomplete view of the sql from each materialization. This information should be more granular and be visible during a run. Structured logging is a formal interface for programmatically consuming metadata about a dbt run, and is where this more granular information will be streamed to outside consumers.

Context

The materialization macro and the adapter functions it calls are natural places to put these log lines. However, many adapters override materialiations directly (i.e. snowflake, bigquery, spark), and others override key adapter functions such as adapter.execute (i.e. bigquery) which would require these and future adapter maintainers to remember to copy our log lines exactly the same. Requiring adapter maintainers to take direct action to maintain the reliability of the structured logging interface is something we'd like to avoid, which is why this ticket puts these logs lines before and after the call to execute in statement.sql (source). Since statement.sql is not a Python file, structured logging must be accessible to jinja and today it is not.

Implementation

  • [ ] Create a new function in the jinja context log_event like log (source) so that instead of taking a string msg, it takes a string event_name and a list of values of type List[Any]. This will fire an event with the same name as the string passed with the parameters in the list. (Using Event.__subclasses__(), and the splat operator for lists (i.e. *[foo, bar]) may help here). Mypy will not be able to catch any type mismatches here since the event constructors aren't passed concrete values till runtime. Since these log lines will run with every materialization, tests should catch any issues with these log lines.
  • [ ] Update SQLQueryStatus to take in the full adapter_response object, and node name. Today we stringify the adapter reponse first making it much harder to consume. Special attention will have to be paid to serialization since each warehouse will be putting different values into their own responses. Using examples from today's warehouses for inputs to serialization tests is a good idea. The name of the currently running node is accessible via the "jinja god context" (TODO: link example of how to access the jinja god context) (TODO: adapter.execute is called in other places too. We should add log lines there too. Link exact lines here.)

nathaniel-may avatar Jun 01 '22 17:06 nathaniel-may

Requiring adapter maintainers to take direct action to maintain the reliability of the structured logging interface is something we'd like to avoid

Agree in principle. I'm on the fence about sticking this in the statement macro, vs. just expanding the SQLQuery event fired in add_query. This would indeed require some work in dbt-bigquery's custom execute, which doesn't feel great, but... almost all adapters out in the world inherit from the "SQL" adapter. [Update: I remembered that dbt-snowflake also overrides add_query, though it does ultimately call super().add_query as well.]

Definitely curious to hear Gerda's thoughts on how to factor this code. Also tagging @dataders for visibility / fruit for thought: To what extent must we shy away from high-impact logging, or other important metadata, living in the adapter?

TODO: adapter.execute is called in other places too. We should add log lines there too. Link exact lines here.

The important ones I can find:

We use it in our functional test utilities as well, here and here

Logging relation info as soon as it's built

This was another thing we discussed earlier today, related to this effort but possibly its own ticket

Let's add structured logging about the relations produced by the materialization, as soon as it finishes building them (here). Materializations already return the set of relations it creates/updates, for the purposes of updating dbt's cache. Why not share the wealth with programmatic consumers of dbt metadata?

(Will serializing Relation objects be an absolute nightmare? Relation objects can be reimplemented by adapter, of course, though they all inherit from BaseRelation, which should be serializable. Even so, we may not want all the object attributes included in the logging event—probably just a subset.)

For now, the only really valuable information included on the relation object is database location (database.schema.identifier) and relation type (view, table, etc). However, I could see doing two things to make this very valuable, for which this logging lays necessary groundwork:

  • Additional fields like columns (with data types) and table statistics (maybe even column-level stats, too) — a.k.a. the same basic contract as CatalogInfo
  • Steps at the end of each materialization that describe the just-built relation, to populate those fields, which will then be logged out once the materialization completes

Put it all together, and we'll be able to provide realer-time access to catalog info, rather than trying to grab it all in one big memory-intensive batch during docs generate.

jtcohen6 avatar Jun 01 '22 19:06 jtcohen6

I've come around on this, given the way that we've implemented Python on some adapters. The statement macro has become the "gateway" for both SQL + Python code, and it will call either adapter.execute or adapter.submit_python_job (on some adapters). Rather than require logging in both of those methods, trusting adapter maintainers to implement it for us, we should have that logging in one place — statement.

We'd still want to handle cases where:

  • We hook into adapter.execute directly today (linked above). It's theoretically possible to call adapter.execute directly from the Jinja context, bypassing the statement / run_query situation — though it's not something we document.
  • It's theoretically possible for an end user to override the statement / run_query macro... though I think a LOT of things would break :)

jtcohen6 avatar Jul 21 '22 18:07 jtcohen6

In python model we currently added a decorator to log things.

ChenyuLInx avatar Jul 27 '22 22:07 ChenyuLInx

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

github-actions[bot] avatar Jan 25 '23 01:01 github-actions[bot]

Assigning myself to reopen as 2-3 separate issues, and then close this one

jtcohen6 avatar Jan 25 '23 09:01 jtcohen6

Closing in favor of #6731, #6732, #6733

jtcohen6 avatar Jan 25 '23 18:01 jtcohen6