dbt-databricks Expecting all purpose cluster stop be utilised when configured job cluster detail

Expecting all purpose cluster stop be utilised when configured job cluster detail

Open tade0726 opened this issue 1 year ago • 4 comments

trafficstars

Describe the bug

When I try the feature of running jobs on the job cluster instead of the all-purpose cluster, both the purpose cluster and the job cluster trigger, and only when the all-purpose starts, and the the job cluster follows.

Steps To Reproduce

Configure all purpose cluster detail in profiles.yaml
Configure job cluster in dbt_project.yml, follow instructions (https://docs.getdbt.com/docs/build/python-models#specific-data-platforms)

Expected behavior

Once configured job cluster detail provided, all purpose cluster should not be trigged.

Screenshots and log output

System information

The output of dbt --version:

Core:
  - installed: 1.7.3
  - latest:    1.7.7 - Update available!

  Your version of dbt-core is out of date!
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

Plugins:
  - databricks: 1.7.2 - Update available!
  - spark:      1.7.1 - Up to date!

  At least one plugin is out of date or incompatible with dbt-core.
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

The operating system you're using:

Mac OS, should be irrelevant

The output of python --version:

Python 3.10.13

Additional context

NaN

Feb 05 '24 13:02 tade0726

This is expected behavior, as python models are integrated into the rest of your dbt project using SQL (for example, on an incremental model, the merge behavior is conducted in SQL), and that SQL would be executed on the AP Cluster. We are investigating ways for python model behavior to be more 'spark-like', but for now I would say this is an enhancement request, rather than a bug, as it is consistent with the structure imposed by dbt-core.

Feb 05 '24 16:02 benc-db

Thanks, Benc. It clears my doubts.

Feb 05 '24 22:02 tade0726

@benc-db Would it be possible to use a more simple approach when running a python model inside a job cluster like following:

dbt creates a new notebook for the python model
the new notebook is executed withing dbt using python command dbutils.notebook.run("....") (see Run a Databricks notebook from another notebook) inside a own process

I am not sure but it looks to me, that the strict seperation between execution (dbt python code) and the model execution (putting model into an isolated space) seems to be a bit oversized on Databricks job clusters, because the job will run nevertheless on spark on the master node. But maybe I am not getting the full picture of this issue...

Feb 07 '24 21:02 leo-schick

dbt-databricks dbt-databricks copied to clipboard

Expecting all purpose cluster stop be utilised when configured job cluster detail

Describe the bug

Steps To Reproduce

Expected behavior

Screenshots and log output

System information

Additional context

dbt-databricks
dbt-databricks copied to clipboard