dbt-core fix: override materialization python models

resolves #8520 , resolves #9703

Problem

Current Behavior

When working within my company's workflow, we outline all configurations, metadata, etc. in the schema.yml file, with broader configurations being described in the dbt_projects.yml. However, the materialization of Python models cannot be controlled through these files. This requires us to set the configuration within the model itself, as demonstrated below:

def model(dbt, session):
    dbt.config(materialization='incremental')

If this isn't done, the materialization always defaults to 'table' in the generated manifest.json.

commit that introduces this behavior: https://github.com/dbt-labs/dbt-core/blame/48d04e81417195393af5af1f78ef695f3398f193/core/dbt/parser/base.py#L217

Expected Behavior

I would expect that the materialization of Python models could be controlled through these configuration files, just as it is observed with SQL models. Ideally, without having to specify the materialization within the model itself. When defining configurations in these files, they should be respected, and the materialization shouldn't default to 'table' in the generated manifest.json unless explicitly set to do so. The current behavior introduces an inconsistency in the way dbt models are configured and managed.

Solution

I've considered three approaches to address this bug. Firstly, I'll outline the two solutions I discarded, followed by an explanation of why I chose the third.

First Solution: (discarded)

Remove the behavior introduced by commit 48d04e8141719539.

While this is a straightforward, it could break many existing pipelines. This would be unfavorable for many dbt users because it's a behavior available since v1.4.

Second Solution: (discarded)

The proposal here is to add a condition where the default materialization value would only used if it hasn't been predefined in a .yml file or within the model's .py file itself.

# core/dbt/parser/base.py:L192
if block.path.relative_path.endswith(".py"):
    language = ModelLanguage.python
    config.add_config_call({"_dbt_default_arg_materialized": "table"})

# core/dbt/context/context_config.py:L140
if patch_config_dict:
    result = self._update_from_config(result, patch_config_dict)

default_args = [k for k in config_call_dict if k.startswith("_dbt_default_arg_")]
result_keys = dir(result)
for key in default_args:
    config_key = key.replace("_dbt_default_arg_", "")
    if config_key in result_keys:
        del config_call_dict[key]
    else:
        config_call_dict[config_key] = config_call_dict[key]

However, this approach has two major issues:

1- The solution is messy and has a high cognitive load. 2- If the configuration for this materialization isn't defined in the model's .py or any .yml file, it defaults to creating a "view" materialization. This is because the NodeConfig class defaults to "view" as its materialization. https://github.com/dbt-labs/dbt-core/blob/e5e1a272ffbb1072332b7f2d71e31b805296ddcb/core/dbt/contracts/graph/model_config.py#L438-L441

It's just a hypothesis, but I believe this might be the very reason for commit 48d04e8141719539 since "view" is not supported for Python models.

Third Solution (preferred):

In base.py, the proposal is to add a key-value pair to the config_call_dict.

https://github.com/devmessias/dbt-core/blob/21696c70c5c3c22003f4282405df9fad65a96ce5/core/dbt/parser/base.py#L215-L224,

Parameters described in _dbt_node_type_configs are used to overwrite the initial node configuration:

https://github.com/devmessias/dbt-core/blob/21696c70c5c3c22003f4282405df9fad65a96ce5/core/dbt/context/context_config.py#L136

Since we don't want the "_dbt_node_type_configs" to persist, I opted to remove it in the latest update.

https://github.com/devmessias/dbt-core/blob/21696c70c5c3c22003f4282405df9fad65a96ce5/core/dbt/context/context_config.py#L152-L156

Checklist

[x] I have read the contributing guide and understand what's expected of me
[X] I have run this code in development and it appears to resolve the stated issue
[x] This PR includes tests, or tests are not required/relevant for this PR
[x] This PR has no interface changes (e.g. macros, cli, logs, json artifacts, config files, adapter interface, etc) or this PR has already received feedback and approval from Product or DX

Aug 31 '23 21:08 devmessias

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide.

Aug 31 '23 21:08 github-actions[bot]

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 62.79%. Comparing base (7940ad5) to head (7d743da). Report is 53 commits behind head on main.

:exclamation: There is a different number of reports uploaded between BASE (7940ad5) and HEAD (7d743da). Click for more details.

HEAD has 23 uploads less than BASE

Flag BASE (7940ad5) HEAD (7d743da)

unit 4 1

integration 20 0

Flag	BASE (7940ad5)	HEAD (7d743da)
unit	4	1
integration	20	0

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #8538       +/-   ##
===========================================
- Coverage   89.22%   62.79%   -26.43%     
===========================================
  Files         183      183               
  Lines       23457    23630      +173     
===========================================
- Hits        20930    14839     -6091     
- Misses       2527     8791     +6264

Flag	Coverage Δ
integration	`?`
unit	`62.79% <100.00%> (+0.70%)`	:arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
Unit Tests	`62.79% <100.00%> (+0.70%)`	:arrow_up:
Integration Tests	`62.79% <100.00%> (-26.43%)`	:arrow_down:

Aug 31 '23 21:08 codecov[bot]

Hi @gshank,

Could I please receive feedback on this PR? One aspect I'm unable to understand is why the codecov is reducing the percentage of coverage. This PR did not introduce any branch conditions inside the code. I've noticed a similar decrease in some other PRs as well; should I be concerned about this at the moment?

Sep 08 '23 22:09 devmessias

I think the original "fix" (setting the materialization in the config_call_dict) was wrong, but I think this approach just doubles down on that wrong approach. This is a tricky area. I suspect that what's needed is some special setting in "update_parsed_node_config" in core/dbt/parser/base.py. The problem is detecting the difference between somebody actually setting the materialization vs it just defaulting. I'll have to poke around a bit more to see if I can find a good way to do this.

Sep 14 '23 14:09 gshank

I think the original "fix" (setting the materialization in the config_call_dict) was wrong, but I think this approach just doubles down on that wrong approach. This is a tricky area. I suspect that what's needed is some special setting in "update_parsed_node_config" in core/dbt/parser/base.py. The problem is detecting the difference between somebody actually setting the materialization vs it just defaulting. I'll have to poke around a bit more to see if I can find a good way to do this.

Thank you for your feedback, @gshank. I agree that this change may not be the ideal one.

It's a great idea to bring @aranke into this discussion, since they were responsible for the original commit that introduced this behavior.

Note: As I mentioned in the PR, it seems to me that the solution from @aranke is easier to reconcile the differences between Python and SQL models. I've tried to rethink the behavior that @aranke introduced and try to place it elsewhere but couldn't, maybe there really isn't a cleaner and fast solution yet, especially since it's a behavior that breaks the convention of all versions post 1.4.

Hey @aranke, could you share your thoughts to enrich our conversation? What do you think?

Sep 18 '23 14:09 devmessias

Hey @devmessias,

My original PR was just to default to table materialization for Python models, but I see that it's had this unintended side effect. I agree with @gshank that we should probably take a step back and check what the expected behavior should be.

Sep 25 '23 14:09 aranke

A big issue is that removing the default table behavior could break numerous pipelines, as views are not supported in Python models.

Sep 25 '23 16:09 devmessias

One option I am considering, given that @aranke only wanted to change the default materialization for Python (so I assume there is no point in considering future default changes), and as @gshank suggested it to be something in update_parsed_node_config, we could simply check if the language in parsed_node within the same function is a Python model and at the same time verify if the materialization is set as view, then change it to table. Since this is what causes the issue. What do you think? It would also have the advantage of being much simpler.

Sep 26 '23 01:09 devmessias

@devmessias I agree with your proposed solution. In update_parsed_node_config check that the language is python and that materialization is not set in the config dict, and if both of those are true, set it in the config dict.

Oct 02 '23 13:10 gshank

I've updated the PR with the proposed solution @gshank. Could you review the changes to ensure we're on the same page?

Oct 02 '23 17:10 devmessias

Ah, I forgot to mention one thing. The default is set before the update_node_config. As a result, the following test created by @aranke , which was passing in my previous suggestion, won't pass now. Would it be better to remove this test and create a new one that checks only after the update_node_config?

Oct 02 '23 18:10 devmessias

Yes, that would be better. If you can find a way to do it in unit tests, that would be fine, but it might be easier to just look at the manifest for the python node in a functional test.

Oct 02 '23 18:10 gshank

Hi @gshank. I've managed to determine how to retain the unit test. My previous approach was somewhat messy as it also had to manage config.build_config_dict(patch_config_dict=patch_config_dict) to address all issues within update_parsed_node_config. However, I've modified update_parsed_node_config to pass the model language as an argument to ContextConfigGenerator, thereby creating an appropriate NodeConfig for the Python model. I believe this method is superior. Could you please review my new commit and share your thoughts?

Oct 03 '23 12:10 devmessias

Hi @aranke . What do you think about this new approach?

Oct 13 '23 13:10 devmessias

Hi @gshank, just circling back on this pull request. Would appreciate your feedback when it's convenient for you. Thanks!

Oct 24 '23 15:10 devmessias

Hi @dbeatty10, @gshank and @aranke, any updates on this bug/MR? I'm not sure about the dbt-core roadmap, or if bug #8520 was resolved in another MR, because if so, I can close this PR.

Dec 27 '23 00:12 devmessias

@graciegoheen and @peterallenwebb when you take a look at this, could you also check if it resolves https://github.com/dbt-labs/dbt-core/issues/9703?

The reprex is here: https://github.com/dbt-labs/dbt-core/issues/9703#issuecomment-1979983639

Apr 12 '24 01:04 dbeatty10

ping everyone!

Oct 07 '24 13:10 devmessias