dbt-core icon indicating copy to clipboard operation
dbt-core copied to clipboard

Clone sources

Open dbrtly opened this issue 5 months ago • 3 comments

Is this your first time submitting a feature request?

  • [X] I have read the expectations for open source contributors
  • [X] I have searched the existing issues, and I could not find an existing issue for this feature
  • [X] I am requesting a straightforward extension of existing dbt-bigquery functionality, rather than a Big Idea better suited to a discussion

Describe the feature

dbt clone \ —target ci_env \ —select +modified \ —resource-type source \ —state target=prod

I want to clone sources when they exist upstream of the modified models.

Describe alternatives you've considered

Clone less specifically (example all the raw layer) Hardcode the project in source configuration (not offically supported, iam ramifications)

Who will this benefit?

For model testing we want to validate against data in prod gcp project from the test gcp project.

Are you interested in contributing this feature?

Yep

Anything else?

I might need help with the testing.

dbrtly avatar Feb 08 '24 04:02 dbrtly

Thanks for opening this @dbrtly !

Can you share more details about the use-case(s) you are trying to solve for?

Maybe you have a PR that made code changes to a model, and you're trying to check if it produces the same data output or not?

dbeatty10 avatar Feb 13 '24 20:02 dbeatty10

Yes exactly.

Currently, we purge bigquery, the arrange the environment with:

  • clone state:modified
  • clone + state:modified —resource-type table
  • clone + state:modified —resource-type incremental
  • run clone +state:modified —resource-type view
  • seed +state:modified

But that still misses sources, a dbt command would be like the others.

A command that simplified all that would be even better:

dbt clone --target test --ci-arrange --state target=prod

Thanks, Daniel


From: Doug Beatty @.> Sent: Wednesday, February 14, 2024 7:16:47 AM To: dbt-labs/dbt-core @.> Cc: Daniel Bartley @.>; Mention @.> Subject: Re: [dbt-labs/dbt-core] Clone sources (Issue #9550)

Thanks for opening this @dbrtlyhttps://github.com/dbrtly !

Can you share more details about the use-case(s) you are trying to solve for?

Maybe you have a PR that made code changes to a model, and you're trying to check if it produces the same data output or not?

— Reply to this email directly, view it on GitHubhttps://github.com/dbt-labs/dbt-core/issues/9550#issuecomment-1942379182, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADLXPTMY4VHCF3KR6QAQJQ3YTPC27AVCNFSM6AAAAABDCLTHMOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBSGM3TSMJYGI. You are receiving this because you were mentioned.Message ID: @.***>

dbrtly avatar Feb 14 '24 00:02 dbrtly

What is the end goal of the cloning step for sources? Is it to guarantee both environments are using the same exact input copy of the data? Is it to "freeze" the source data so that it can't change while running CI?

For continuous integration (CI) use cases, we recommend cloning incremental models as the first step of your CI job (only for warehouses that support zero copy cloning). After that, we recommend to defer to the production environment (rather than cloning).

Is there some reason that using --defer doesn't work for you?

Because of where sources sit in the DAG, they are "off limits" for creating database objects -- they are read-only references to data rather than being editable.

dbeatty10 avatar Feb 14 '24 04:02 dbeatty10

We have found --defer to be buggy. It mostly works but when it stops most of the team is unsure how to debug it.

I end up dropping everything else to do an emergency debug and fix. It impacts on the credibility of automated tests. Our github notifications scream about the mysterious failed tests. It's tiring for me.

dbrtly avatar Feb 16 '24 07:02 dbrtly

Summary

dbt clone is restricted only to nodes within the DAG that dbt actually builds.

Since dbt only references sources and doesn't build them, it would be inconsistent (and potentially problematic) for us to clone them. So I'm going to close this issue as "not planned".

Follow-up questions about --defer

@dbrtly based on your experience, do you think there are bugs with --defer that we can reproduce and fix within dbt-core?

Or is its behavior unintuitive because it relies heavily on which objects do (or don't) exist within your current environment? (See below for explanations from our documentation about --defer.)

If it's truly a bug, would you be willing to open up bug reports as those occur? I'm not seeing anything outstanding here that looks like what you are describing.

Behavior of --defer

Here's the section of the documentation the explains some of the tricky bits:

When the --defer flag is provided, dbt will resolve ref calls differently depending on two criteria:

  1. Is the referenced node included in the model selection criteria of the current run?
  2. Does the reference node exist as a database object in the current environment?

If the answer to both is no—a node is not included and it does not exist as a database object in the current environment—references to it will use the other namespace instead, provided by the state manifest.

Ephemeral models are never deferred, since they serve as "passthroughs" for other ref calls.

dbeatty10 avatar Feb 16 '24 17:02 dbeatty10

The developers/mgt has typically classed the test automation as broken when “It worked in dev”.

The tough edges with the defer arguments have also been related to sources. We test our models in a different database than prod but sometimes the sources are in the same database and sometimes not. Getting precise config for all the sources has been a journey.

There have also been permission issues with the service account in the test database having access to sources (particularly external sources). Cloning everything is relatively fast, cheap and easier as a brute force high-level validation that the tests are ready to delegate to the automation.

dbrtly avatar Feb 18 '24 05:02 dbrtly

Thanks for sharing more information about the situations you've run into @dbrtly 🧠

Even if it were possible to clone sources, you'd still need to sort out any permissions issues.

Neither of your situations sound like bugs with --defer, but please do raise them if you run into any in the future.

dbeatty10 avatar Feb 19 '24 18:02 dbeatty10