dbt-core icon indicating copy to clipboard operation
dbt-core copied to clipboard

[Feature] Include column-level information in dbt's adapter cache, for faster get_columns_in_relation

Open OneCyrus opened this issue 1 year ago • 4 comments

Is this your first time submitting a feature request?

  • [X] I have read the expectations for open source contributors
  • [X] I have searched the existing issues, and I could not find an existing issue for this feature
  • [X] I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

I have some macros which rely heavily on get_columns_in_relation. they are macros which operate on the column level and there are a lot of calls to the same relation to get information about columns. in the end there are thousands of calls to get_columns_in_relation while there are actually 10-100 different relations. as the calls are quite expensive it would be great to have a shared memory cache for each macro which survives the call of the macro.

it should be something like a key/value-store which would allow to add the return value of get_columns_in_relation for a specific relation and in the next macro call we would check if the value is already there and only call get_columns_in_relation if we don't have the value already.

Describe alternatives you've considered

surely we can call get_columns_in_relation in other places and let the result flow through the codebase. but this would lead to a lot of complexity and is not a good tradeoff for the performance gain.

Who will this benefit?

everyone who loves a performant dbt

Are you interested in contributing this feature?

No response

Anything else?

No response

OneCyrus avatar Feb 17 '24 11:02 OneCyrus

everyone who loves a performant dbt

:heart:

I would love to support caching at the column level. It's not just custom user macros that make covetous use of get_columns_in_relation, it's our own materializations as well (especially snapshots).

We already have a "relational" cache, stored on the adapter for the duration of an invocation, and it would be possible to extend this cache to also include information about column names & data types.

However: That alone is not enough. We also need to find & update all macros that change the columns in a relation (adding, removing, resizing, ...) — those macros either need to update the cache (better), or clear it (acceptable) — same as we do for materializations that create relations, or methods that drop them.

If we start caching column metadata before we've done that, we will end up with incorrect results, which is what happened when we experimented with it in dbt-spark a few years ago:

  • https://github.com/dbt-labs/dbt-spark/issues/431
  • https://github.com/dbt-labs/dbt-spark/issues/447

So - I think it's doable, I think it's interesting, I think it will first require a full accounting of all the places where we (potentially) mutate the columns of a model. Who's interested in a little bit of spelunking? :)

jtcohen6 avatar Feb 20 '24 13:02 jtcohen6

@jtcohen6 great to hear :)

just to understand the problem-space better: in which use-cases do we have schemas which change retroactively? i thought dbt materialization completely follows the directed acyclic graph and once a model is materialized the schema of this model is immutable for all downstream dependencies. basically we shouldn't face a changing schema in a dbt run.

or would that be if we request schemas of models which aren't materialized yet? (does that even work?)

OneCyrus avatar Feb 20 '24 13:02 OneCyrus

@jtcohen6, I'm sort of interested in diving in, but I'm not clear on scope. does "all the places" include:

  1. adapters?
  2. first party dbt packages?
  3. third party dbt packages on package hub?

mike-weinberg avatar May 18 '24 07:05 mike-weinberg

@OneCyrus This came up specifically in the dbt-spark example because we cached the columns of an incremental model at the start of a run / at the start of its materialization, then modified the columns in the model (alter statements for schema evolution as part of on_schema_change config), then introspected the columns again to template out the subsequent insert statement (cache lookup), and got the wrong values. The problem is that the alter statement didn't update or invalidate the cache.

@mike-weinberg In theory, all three; in practice, primarily adapters or packages that roll custom materializations, since that tends to be where dbt is modifying the structure of DWH objects (more than just introspecting the structure/content).

To be clear, this would be spelunking without guarantee of striking gold. I'm not convinced that our current caching patterns — populating a bunch of batch information upfront that we can get easily/cheaply (show-type statements), then using the cache to definitively answer simple positive/negative questions about which objects exist — will scale well to additional information that's less cheap/fast to batch up.

To simplify scope, I might start with:

  • just Snowflake
  • just with the goal of making get_columns_in_relation reliably return a cached value if called multiple times in sequence
  • with cache updates/invalidation when the adapter knows it's evolving the schema

jtcohen6 avatar May 20 '24 08:05 jtcohen6