dbt-core
dbt-core copied to clipboard
Defining vars, folder-level configs outside dbt_project.yml
Describe the feature
From @benjaminsingleton:
Iād like to use project level variables more, but Iām concerned about bloat to my already large
dbt-project.yml
file. I think it would be helpful if I could create avariables.yml
file that could be imported in dbt-project.yml . And for that matter, the same could be done for other configurations in thedbt_project.yml
file. I think having the ability to separate configurations into different files might make for improved modularity / separation of concerns (particularly for large projects), not to mention fewer merge conflicts. CC @jrandrews
Describe alternatives you've considered
- We're already thinking of enabling some configs in resource-YAML files (#2401), but these would be at the level of the individual resource (model/seed/snapshot/etc) only
- The
dbt_project.yml
gets really really big??
Additional context
- I don't think this has any correspondence to v1.0. It's a nice thing to have, and we can could do it before, after, any time without it being a breaking change in any way.
Who will this benefit?
- Developers and maintainers of increasingly big dbt projects
Some additional thoughts/failure modes/concerns to think about with this:
- In large dbt projects, one problem with project-wide variables is the potential for developers to "step" on each other by editing or overwriting or conflicting with each other's var declarations. If devs were being meticulous in looking at the project-wide relevance of a given var, then this might happen less or not at all, but that is unfortunately often not the case. If we allow variable declaration (and other things) outside of just one file (say dbt_project.yml), and it is an arbitrary number of files, then I can see Dev1 defining my_var_p in variable_file_1.yml and then Dev2 defining my_var_p in variable_file_2.yml. I suppose/hope that dbt would detect that and throw and error but there are still some clunky workflow issues in allowing variable declarations in multiple different .yml files.
- Vars need to be parsed before other .yml files for declarations around models, tests, etc. are parsed. Right now this problem is handled by having one hard-coded .yml file (that is, dbt_project.yml) to be parsed before the other .yml files, but if we loosen this then dbt still needs a way to be able to determine how/what to parse in "pass 1" of parsing for vars (and I am sure a lot of other things that other, smarter people than I know already happen first :) ) versus "pass 2" of parsing for other things like tests, models, etc. And not just dbt -- this understanding of what .yml file gets parsed when needs to be not-too-hard to quickly understand for average devs. Otherwise people will be just littering random var declarations mixed in with tests and model config and then getting confused why things don't work.
One thought that comes to me - what if we had another separate set of subdirectories that were specifically dedicated to .yml files for config, like we already have a subdirectory declaration/space for snapshots, models, etc. Only things related to/extracted from dbt_project.yml could be put in there, and any other things related to models, snapshots, etc., would trigger a parsing error. This doesn't fix problem 1 above but it at least helps with problem 2. What say ye?
P.S. Also, I know var namespacing was removed for dbt_project.yml v2 config in .17 for some good reasons but it's also pretty hard not to have any way to do variable scoping in larger projects.
@codigo-ergo-sum I don't think I've ever seen you post from this alt account before. It goes without saying that I like the handle.
what if we had another separate set of subdirectories that were specifically dedicated to .yml files for config, like we already have a subdirectory declaration/space for snapshots, models, etc.
This is along the lines of what I was thinking: either an explicit set of subdirectories, or an explicit set of named files. I've been around just long enough to remember when packages
was a special dict in dbt_project.yml
rather than its own file; we split it out because we expected it to grow in size, and because it served a distinct purpose. We made the same choice for selectors.yml
.
I'd be especially keen on a vars.yml
: variables have a slightly different parsing context, we can be strict about accepting only literal values, and we could even do a better job of parsing vars.yml
before parsing dbt_project.yml
. That would make default values of vars
called in dbt_project.yml
work the way folks expect, rather than how it is today. I like that correspondence between vars.yml
and CLI --vars
, similar to how env vars can be sourced from an *.env
file or prepended to a CLI command.
Configurations feel a bit trickier, because these can be especially verbose. How to coordinate hierarchies across multiple files without someone tripping over someone else? I'm honestly not sure. The cleanest separation I can envision would be allowing a project to have one each of models.yml
, seeds.yml
, etc.
P.S. Also, I know var namespacing was removed for dbt_project.yml v2 config in .17 for some good reasons but it's also pretty hard not to have any way to do variable scoping in larger projects.
This is fair. I wonder if the ability to scope vars
differently for different model subsets may ultimately serve as a valid reason to split very big projects up into multiple sub-projects, installed as packages. That's regardless of whether they live in the same or separate repositories.
Thanks for the compliment on the username @jtcohen6 :).
I think a vars.yml
file would be a definite improvement over the current situation.
Would it be required or could vars also still be defined in dbt_project.yml
? If required then that probably requires a new version 3 of the schema version for dbt_project.yml
which is a sigificant change for existing users, right?
If not required, then what would the behavior look like if vars are defined in both places, and if they conflict? And are you suggesting that vars.yml
would be parsed before dbt_project.yml
is parsed? Allowing full, "no-gotcha" usage of vars in dbt_project.yml
?
I absolutely support the idea of parsing the vars before the dbt_project.yml. This would enable leveraging vars in many additional ways, e.g. to enable or disable subfolders or defining schemas from vars, without loosing the ability to simply run a model using the vars default values.
Having outside vars available in dbt_project.yml
would be a huge improvement.
We have a complex dbt_project.yml
, with lots of repetition and using Jinja a lot.
For example:
source-paths:
- modules/shared/models
- modules/module1/models
- modules/module2/models
- stages/{{ env_var('DBT_STAGE', '@fake@') }}/models
Then enabling a particular stage via env var during deployment.
Would be great to set the stage once in the vars file, and then just use the var itself in the config. Also be able to define module names / prefixes, or even an entire array of modules to loop over.
Hello š
In our company, we are using a lot DBT in a multi tenant context. For that purpose, we rely a lot on DBT variables with which we propagate the client configuration. Those configurations could be really different from a client to another.
Sometimes we have faced the following issue argument list too long: dbt
, which is due to the large config payload (e.g. some of them could reach more than 600Kb).
We did not find a proper workaround for now. Passing a file path instead of a payload for our variables would probably solve our issue. This is why we are keen to know if there is any chance you are going to consider such feature for DBT ? (cc. @jtcohen6)
Thank you in advance š
++ this feature. The solution implemented by Jekyll (with _data directory) comes to mind as suitable.
Hi all,
I agree that a vars.yml will be a good boost, but there you'll have just some global variables. Based on my background experience I think you should think as well to a solution for local variables. Some sort of accepting in a model configuration to define a model_vars.yml and use the variables for that specific model from there.
Thank you.
I agree. Hope this will get implemented soon as it is always a good practice to modularize the configurations, rather than having everything in same single file.
dbt still only allow global variables defined in dbt_projetc.yml?
Just looking through issue backlogs and wanted to bump this... Would be great as we are working with projects that have tens or even hundreds of variables now. Also the lack of ability to namespace them is still challenging.
I'm also facing this problem and would very much love some ideas of how to tackle it!
Currently we're trying to workaround this issue by using environment variables (and tooling via direnv and a .envrc file)
Sneaky workaround whilst wait for this to be built into core. Basically move var declarations into macro files:
https://gist.github.com/jeremyyeo/06d552ee8facc8100416655ebc25d9b9
Sneaky workaround whilst wait for this to be built into core. Basically move var declarations into macro files:
https://gist.github.com/jeremyyeo/06d552ee8facc8100416655ebc25d9b9
This is exactly what we started to POC in our DBT stack. Using a dedicated macro file to load bigger JSON payload. The idea is to generate a macro file containing all DBT variables. At the end it should look to something like that š
{% macro get_config() %}
{{ return(fromjson("<JSON_CONTENT_HERE>")) }}Ā
{% endmacro %}
And then you can use it in your model:
{% set some_var = get_config().get(...) %}
That's a workaround that should make the job.
Folder-level configs would make a huge difference on my project. With 50+ developers and growing we don't want anyone to modify project-level files day-to-day, but we do want them to manage many files and folders in their subject area.
Folder-level configs would do this. Clearly the need is there which is why they are featured in dbt_project.yml but this causes governance and git conflict problems where many teams trying to make changes to project-level files at the same time.
Basically, I need to treat our subject areas as mini-projects, each mini-project having its own configuration.
Within https://github.com/dbt-labs/dbt-core/issues/8869, @slotrans described var()
not being able see vars
defined in dbt_project.yml
for the purposes of configuring query-comment
.
If this feature request were added, then it would solve that use-case.
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.
@jtcohen6 & @dbeatty10 - are y'all open to contributions on this one?
@ciprian-mandras: Regarding this:
I agree that a vars.yml will be a good boost, but there you'll have just some global variables. Based on my background experience I think you should think as well to a solution for local variables. Some sort of accepting in a model configuration to define a model_vars.yml and use the variables for that specific model from there.
Couldn't you just use model configs for that? In our project, we put all sorts of stuff in the meta
key all the time.
Like this:
some_model.yml
models:
- name: some_model
description: Something something
config:
tags: [tag1, tag2]
meta:
key1:
key_x: value
key_y: value
key_z: value
key2:
key_a: value
key_b:
key: value
key3: value
key4: value
key5: value
For everyone talking about namespacing variables, couldn't you just do it within a dict variable?
Like this:
vars:
namespace1:
key1: value1
key2: value2
namespace2:
key1: value_x
key2: value_y
And then you could retrieve those with an alternative macro (instead of using var('name')
). Something like this:
{%- macro ns_var(namespace, key, default = None) -%}
{%- if default == None and key not in var(namespace) -%}
{{ exceptions.raise_compiler_error("Missing variable '" ~ key ~ "' in namespace " ~ namespace) }}
{%- endif -%}
{{- return(var(namespace).get(key, default)) -}}
{%- endmacro -%}
And call it like:
{{ ns_var('namespace1', 'key1') }}
{{ ns_var('namespace1', 'key3', 'default') }}
{{ ns_var('namespace1', 'key3') }} -- Error
Or you could be fancy and do stuff like:
{{ ns_var('namespace1.key1') }} -- Although `default` remains a separate param, which is weird.
{{ ns1_var('key1') }} -- Hardcoded namespace inside of this macro.
@codigo-ergo-sum: Regarding this:
I think a vars.yml file would be a definite improvement over the current situation.
Would it be required or could vars also still be defined in dbt_project.yml? If required then that probably requires a new version 3 of the schema version for dbt_project.yml which is a sigificant change for existing users, right?
If not required, then what would the behavior look like if vars are defined in both places, and if they conflict? And are you suggesting that vars.yml would be parsed before dbt_project.yml is parsed? Allowing full, "no-gotcha" usage of vars in dbt_project.yml?
I think multiple var files could even easily be supported. The only validation dbt would have to do is to make sure the same variable (i.e. top-level key/namespace) does not exist in more than one file (including dbt_project.yml
, if any variables are still defined there). Otherwise, dbt should produce an error. That's it. The end.
Currently that is supported (although it's probably just how PyYAML loads the file), but it probably shouldn't be (and it's a reasonable breaking change as it's a very easy fix: just remove the duplicate):
vars:
key: 123 # Just delete this one, it does nothing.
key: 456 # Right now, this one "wins".
So we could end up with something like vars_abc.yml
:
vars:
abc:
key: value
...
And vars_xyz.yml
:
vars:
xyz:
key: value
...
But not vars_qrs.yml
:
vars:
abc: # Can't use `abc` again!
key: value
...
And since those are specifically var files, we wouldn't even need that top-level vars:
key.
Folder-level configs would make a huge difference on my project. With 50+ developers and growing we don't want anyone to modify project-level files day-to-day, but we do want them to manage many files and folders in their subject area.
Folder-level configs would do this. Clearly the need is there which is why they are featured in dbt_project.yml but this causes governance and git conflict problems where many teams trying to make changes to project-level files at the same time.
Basically, I need to treat our subject areas as mini-projects, each mini-project having its own configuration.
@markproctor1: I think scattering variables/files everywhere is a terrible idea/bad practice. What do you think of the namespace idea suggested above instead?
And if dbt added support for multiple var files (all in one specific place), each of your subject areas/mini-projects could have its own file & namespace. No more merge conflicts! š
EDIT: Although, thinking about it again now... scattering var files all over the place could be acceptable if dbt_project.yml
had a config like this:
var-paths:
- variables.yml
- team1/variables.yml
- team2/some_folder/variables.yml
- ...
But then the variable files couldn't be loaded before dbt_project.yml
, which sounded like a nice advantage to have. So probably still not a good idea to scatter var files around. š
Instead, we could just have a simple vars
folder, which should be enough for 99.9% of dbt users:
vars/globals.yml
vars/team1.yml
vars/team2.yml
EDIT 2: UNLESS... dbt could also add a new paths.yml
file, which would be loaded first. Then the var-paths
would be loaded. And then dbt_project.yml
and the rest of the stuff would be loaded.
But all that would be completely optional (like a power/advanced feature), and only enabled if paths.yml
exists.
And for those who want to use variables inside of their paths for certain things (models paths, etc.), then just don't put those paths inside of paths.yml
! š Keep them inside of dbt_project.yml
, which should now have access to all the project variables.