dbt-core
dbt-core copied to clipboard
Clone sources
Is this your first time submitting a feature request?
- [X] I have read the expectations for open source contributors
- [X] I have searched the existing issues, and I could not find an existing issue for this feature
- [X] I am requesting a straightforward extension of existing dbt-bigquery functionality, rather than a Big Idea better suited to a discussion
Describe the feature
dbt clone \ —target ci_env \ —select +modified \ —resource-type source \ —state target=prod
I want to clone sources when they exist upstream of the modified models.
Describe alternatives you've considered
Clone less specifically (example all the raw layer) Hardcode the project in source configuration (not offically supported, iam ramifications)
Who will this benefit?
For model testing we want to validate against data in prod gcp project from the test gcp project.
Are you interested in contributing this feature?
Yep
Anything else?
I might need help with the testing.
Thanks for opening this @dbrtly !
Can you share more details about the use-case(s) you are trying to solve for?
Maybe you have a PR that made code changes to a model, and you're trying to check if it produces the same data output or not?
Yes exactly.
Currently, we purge bigquery, the arrange the environment with:
- clone state:modified
- clone + state:modified —resource-type table
- clone + state:modified —resource-type incremental
- run clone +state:modified —resource-type view
- seed +state:modified
But that still misses sources, a dbt command would be like the others.
A command that simplified all that would be even better:
dbt clone --target test --ci-arrange --state target=prod
Thanks, Daniel
From: Doug Beatty @.> Sent: Wednesday, February 14, 2024 7:16:47 AM To: dbt-labs/dbt-core @.> Cc: Daniel Bartley @.>; Mention @.> Subject: Re: [dbt-labs/dbt-core] Clone sources (Issue #9550)
Thanks for opening this @dbrtlyhttps://github.com/dbrtly !
Can you share more details about the use-case(s) you are trying to solve for?
Maybe you have a PR that made code changes to a model, and you're trying to check if it produces the same data output or not?
— Reply to this email directly, view it on GitHubhttps://github.com/dbt-labs/dbt-core/issues/9550#issuecomment-1942379182, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADLXPTMY4VHCF3KR6QAQJQ3YTPC27AVCNFSM6AAAAABDCLTHMOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBSGM3TSMJYGI. You are receiving this because you were mentioned.Message ID: @.***>
What is the end goal of the cloning step for sources? Is it to guarantee both environments are using the same exact input copy of the data? Is it to "freeze" the source data so that it can't change while running CI?
For continuous integration (CI) use cases, we recommend cloning incremental models as the first step of your CI job (only for warehouses that support zero copy cloning). After that, we recommend to defer to the production environment (rather than cloning).
Is there some reason that using --defer
doesn't work for you?
Because of where sources sit in the DAG, they are "off limits" for creating database objects -- they are read-only references to data rather than being editable.
We have found --defer
to be buggy. It mostly works but when it stops most of the team is unsure how to debug it.
I end up dropping everything else to do an emergency debug and fix. It impacts on the credibility of automated tests. Our github notifications scream about the mysterious failed tests. It's tiring for me.
Summary
dbt clone
is restricted only to nodes within the DAG that dbt actually builds.
Since dbt only references sources and doesn't build them, it would be inconsistent (and potentially problematic) for us to clone them. So I'm going to close this issue as "not planned".
Follow-up questions about --defer
@dbrtly based on your experience, do you think there are bugs with --defer
that we can reproduce and fix within dbt-core?
Or is its behavior unintuitive because it relies heavily on which objects do (or don't) exist within your current environment? (See below for explanations from our documentation about --defer
.)
If it's truly a bug, would you be willing to open up bug reports as those occur? I'm not seeing anything outstanding here that looks like what you are describing.
Behavior of --defer
Here's the section of the documentation the explains some of the tricky bits:
When the
--defer
flag is provided, dbt will resolveref
calls differently depending on two criteria:
- Is the referenced node included in the model selection criteria of the current run?
- Does the reference node exist as a database object in the current environment?
If the answer to both is no—a node is not included and it does not exist as a database object in the current environment—references to it will use the other namespace instead, provided by the state manifest.
Ephemeral models are never deferred, since they serve as "passthroughs" for other
ref
calls.
The developers/mgt has typically classed the test automation as broken when “It worked in dev”.
The tough edges with the defer arguments have also been related to sources. We test our models in a different database than prod but sometimes the sources are in the same database and sometimes not. Getting precise config for all the sources has been a journey.
There have also been permission issues with the service account in the test database having access to sources (particularly external sources). Cloning everything is relatively fast, cheap and easier as a brute force high-level validation that the tests are ready to delegate to the automation.
Thanks for sharing more information about the situations you've run into @dbrtly 🧠
Even if it were possible to clone sources, you'd still need to sort out any permissions issues.
Neither of your situations sound like bugs with --defer
, but please do raise them if you run into any in the future.