fix: override materialization python models
resolves #8520 , resolves #9703
Problem
Current Behavior
When working within my company's workflow, we outline all configurations, metadata, etc. in the schema.yml file, with broader configurations being described in the dbt_projects.yml. However, the materialization of Python models cannot be controlled through these files. This requires us to set the configuration within the model itself, as demonstrated below:
def model(dbt, session):
dbt.config(materialization='incremental')
If this isn't done, the materialization always defaults to 'table' in the generated manifest.json.
commit that introduces this behavior: https://github.com/dbt-labs/dbt-core/blame/48d04e81417195393af5af1f78ef695f3398f193/core/dbt/parser/base.py#L217
Expected Behavior
I would expect that the materialization of Python models could be controlled through these configuration files, just as it is observed with SQL models. Ideally, without having to specify the materialization within the model itself. When defining configurations in these files, they should be respected, and the materialization shouldn't default to 'table' in the generated manifest.json unless explicitly set to do so. The current behavior introduces an inconsistency in the way dbt models are configured and managed.
Solution
I've considered three approaches to address this bug. Firstly, I'll outline the two solutions I discarded, followed by an explanation of why I chose the third.
First Solution: (discarded)
Remove the behavior introduced by commit 48d04e8141719539.
While this is a straightforward, it could break many existing pipelines. This would be unfavorable for many dbt users because it's a behavior available since v1.4.
Second Solution: (discarded)
The proposal here is to add a condition where the default materialization value would only used if it hasn't been predefined in a .yml file or within the model's .py file itself.
# core/dbt/parser/base.py:L192
if block.path.relative_path.endswith(".py"):
language = ModelLanguage.python
config.add_config_call({"_dbt_default_arg_materialized": "table"})
# core/dbt/context/context_config.py:L140
if patch_config_dict:
result = self._update_from_config(result, patch_config_dict)
default_args = [k for k in config_call_dict if k.startswith("_dbt_default_arg_")]
result_keys = dir(result)
for key in default_args:
config_key = key.replace("_dbt_default_arg_", "")
if config_key in result_keys:
del config_call_dict[key]
else:
config_call_dict[config_key] = config_call_dict[key]
However, this approach has two major issues:
1- The solution is messy and has a high cognitive load. 2- If the configuration for this materialization isn't defined in the model's .py or any .yml file, it defaults to creating a "view" materialization. This is because the NodeConfig class defaults to "view" as its materialization. https://github.com/dbt-labs/dbt-core/blob/e5e1a272ffbb1072332b7f2d71e31b805296ddcb/core/dbt/contracts/graph/model_config.py#L438-L441
It's just a hypothesis, but I believe this might be the very reason for commit 48d04e8141719539 since "view" is not supported for Python models.
Third Solution (preferred):
In base.py, the proposal is to add a key-value pair to the config_call_dict.
https://github.com/devmessias/dbt-core/blob/21696c70c5c3c22003f4282405df9fad65a96ce5/core/dbt/parser/base.py#L215-L224,
Parameters described in _dbt_node_type_configs are used to overwrite the initial node configuration:
https://github.com/devmessias/dbt-core/blob/21696c70c5c3c22003f4282405df9fad65a96ce5/core/dbt/context/context_config.py#L136
Since we don't want the "_dbt_node_type_configs" to persist, I opted to remove it in the latest update.
https://github.com/devmessias/dbt-core/blob/21696c70c5c3c22003f4282405df9fad65a96ce5/core/dbt/context/context_config.py#L152-L156
Checklist
- [x] I have read the contributing guide and understand what's expected of me
- [X] I have run this code in development and it appears to resolve the stated issue
- [x] This PR includes tests, or tests are not required/relevant for this PR
- [x] This PR has no interface changes (e.g. macros, cli, logs, json artifacts, config files, adapter interface, etc) or this PR has already received feedback and approval from Product or DX
Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide.
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 62.79%. Comparing base (
7940ad5) to head (7d743da). Report is 53 commits behind head on main.
:exclamation: There is a different number of reports uploaded between BASE (7940ad5) and HEAD (7d743da). Click for more details.
HEAD has 23 uploads less than BASE
Flag BASE (7940ad5) HEAD (7d743da) unit 4 1 integration 20 0
Additional details and impacted files
@@ Coverage Diff @@
## main #8538 +/- ##
===========================================
- Coverage 89.22% 62.79% -26.43%
===========================================
Files 183 183
Lines 23457 23630 +173
===========================================
- Hits 20930 14839 -6091
- Misses 2527 8791 +6264
| Flag | Coverage Δ | |
|---|---|---|
| integration | ? |
|
| unit | 62.79% <100.00%> (+0.70%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
| Components | Coverage Δ | |
|---|---|---|
| Unit Tests | 62.79% <100.00%> (+0.70%) |
:arrow_up: |
| Integration Tests | 62.79% <100.00%> (-26.43%) |
:arrow_down: |
Hi @gshank,
Could I please receive feedback on this PR? One aspect I'm unable to understand is why the codecov is reducing the percentage of coverage. This PR did not introduce any branch conditions inside the code. I've noticed a similar decrease in some other PRs as well; should I be concerned about this at the moment?
I think the original "fix" (setting the materialization in the config_call_dict) was wrong, but I think this approach just doubles down on that wrong approach. This is a tricky area. I suspect that what's needed is some special setting in "update_parsed_node_config" in core/dbt/parser/base.py. The problem is detecting the difference between somebody actually setting the materialization vs it just defaulting. I'll have to poke around a bit more to see if I can find a good way to do this.
I think the original "fix" (setting the materialization in the config_call_dict) was wrong, but I think this approach just doubles down on that wrong approach. This is a tricky area. I suspect that what's needed is some special setting in "update_parsed_node_config" in core/dbt/parser/base.py. The problem is detecting the difference between somebody actually setting the materialization vs it just defaulting. I'll have to poke around a bit more to see if I can find a good way to do this.
Thank you for your feedback, @gshank. I agree that this change may not be the ideal one.
It's a great idea to bring @aranke into this discussion, since they were responsible for the original commit that introduced this behavior.
Note: As I mentioned in the PR, it seems to me that the solution from @aranke is easier to reconcile the differences between Python and SQL models. I've tried to rethink the behavior that @aranke introduced and try to place it elsewhere but couldn't, maybe there really isn't a cleaner and fast solution yet, especially since it's a behavior that breaks the convention of all versions post 1.4.
Hey @aranke, could you share your thoughts to enrich our conversation? What do you think?
Hey @devmessias,
My original PR was just to default to table materialization for Python models, but I see that it's had this unintended side effect.
I agree with @gshank that we should probably take a step back and check what the expected behavior should be.
A big issue is that removing the default table behavior could break numerous pipelines, as views are not supported in Python models.
One option I am considering, given that @aranke only wanted to change the default materialization for Python (so I assume there is no point in considering future default changes), and as @gshank suggested it to be something in update_parsed_node_config, we could simply check if the language in parsed_node within the same function is a Python model and at the same time verify if the materialization is set as view, then change it to table. Since this is what causes the issue. What do you think? It would also have the advantage of being much simpler.
@devmessias I agree with your proposed solution. In update_parsed_node_config check that the language is python and that materialization is not set in the config dict, and if both of those are true, set it in the config dict.
I've updated the PR with the proposed solution @gshank. Could you review the changes to ensure we're on the same page?
Ah, I forgot to mention one thing. The default is set before the update_node_config. As a result, the following test created by @aranke , which was passing in my previous suggestion, won't pass now. Would it be better to remove this test and create a new one that checks only after the update_node_config?
Yes, that would be better. If you can find a way to do it in unit tests, that would be fine, but it might be easier to just look at the manifest for the python node in a functional test.
Hi @gshank. I've managed to determine how to retain the unit test.
My previous approach was somewhat messy as it also had to manage config.build_config_dict(patch_config_dict=patch_config_dict) to address all issues within update_parsed_node_config. However, I've modified update_parsed_node_config to pass the model language as an argument to ContextConfigGenerator, thereby creating an appropriate NodeConfig for the Python model. I believe this method is superior. Could you please review my new commit and share your thoughts?
Hi @aranke . What do you think about this new approach?
Hi @gshank, just circling back on this pull request. Would appreciate your feedback when it's convenient for you. Thanks!
Hi @dbeatty10, @gshank and @aranke, any updates on this bug/MR? I'm not sure about the dbt-core roadmap, or if bug #8520 was resolved in another MR, because if so, I can close this PR.
@graciegoheen and @peterallenwebb when you take a look at this, could you also check if it resolves https://github.com/dbt-labs/dbt-core/issues/9703?
The reprex is here: https://github.com/dbt-labs/dbt-core/issues/9703#issuecomment-1979983639
ping everyone!