dbt-databricks
dbt-databricks copied to clipboard
Expecting all purpose cluster stop be utilised when configured job cluster detail
Describe the bug
When I try the feature of running jobs on the job cluster instead of the all-purpose cluster, both the purpose cluster and the job cluster trigger, and only when the all-purpose starts, and the the job cluster follows.
Steps To Reproduce
- Configure all purpose cluster detail in
profiles.yaml - Configure job cluster in
dbt_project.yml, follow instructions (https://docs.getdbt.com/docs/build/python-models#specific-data-platforms)
Expected behavior
Once configured job cluster detail provided, all purpose cluster should not be trigged.
Screenshots and log output
System information
The output of dbt --version:
Core:
- installed: 1.7.3
- latest: 1.7.7 - Update available!
Your version of dbt-core is out of date!
You can find instructions for upgrading here:
https://docs.getdbt.com/docs/installation
Plugins:
- databricks: 1.7.2 - Update available!
- spark: 1.7.1 - Up to date!
At least one plugin is out of date or incompatible with dbt-core.
You can find instructions for upgrading here:
https://docs.getdbt.com/docs/installation
The operating system you're using:
Mac OS, should be irrelevant
The output of python --version:
Python 3.10.13
Additional context
NaN
This is expected behavior, as python models are integrated into the rest of your dbt project using SQL (for example, on an incremental model, the merge behavior is conducted in SQL), and that SQL would be executed on the AP Cluster. We are investigating ways for python model behavior to be more 'spark-like', but for now I would say this is an enhancement request, rather than a bug, as it is consistent with the structure imposed by dbt-core.
Thanks, Benc. It clears my doubts.
@benc-db Would it be possible to use a more simple approach when running a python model inside a job cluster like following:
- dbt creates a new notebook for the python model
- the new notebook is executed withing dbt using python command
dbutils.notebook.run("....")(see Run a Databricks notebook from another notebook) inside a own process
I am not sure but it looks to me, that the strict seperation between execution (dbt python code) and the model execution (putting model into an isolated space) seems to be a bit oversized on Databricks job clusters, because the job will run nevertheless on spark on the master node. But maybe I am not getting the full picture of this issue...