dbt-databricks Running dbt-databricks on a job cluster

Describe the feature

It is possible to run dbt SQL models inside a job cluster when:

you use the dbt-spark adapter instead of dbt-databricks
use the session profile configuration
use Programmatic invocations to execute dbt

I would like to see:

a configuration in the dbt-databricks adapter for session execution so I do not need to use dbt-spark
an official way documented how to run dbt-databricks on a job cluster

Describe alternatives you've considered

I did several tests with the token based authentication but it looks like that in job clusters have another spark endpoint. Token based authentication does not work on a job cluster.

Something to note

Python models currently do not work with this approach.

Who will this benefit?

The Databricks license costs are reduced, because no general purpose cluster is necessary to run dbt inside databricks.

Feb 02 '24 07:02 leo-schick

Hi @leo-schick, could you please give a example how to run SQL models on Databricks job Cluster? Thanks!

May 03 '24 13:05 gaoshihang

@gaoshihang I wrote now an article on Medium How to run dbt on a Databricks job cluster

May 11 '24 10:05 leo-schick

@leo-schick feel free to upvote our feature request for a more native way to run dbt on job clusters https://ideas.databricks.com/ideas/DBE-I-1415

Jun 13 '24 09:06 moritzmeister

@moritzmeister i do not have access to this page

Jun 13 '24 10:06 leo-schick

@leo-schick - I'm very interested in this topic right now, also. I have a case where we have a number of models that need to be run for a given task, and until now we've just been eating the cost of running an all-purpose cluster and directing model exec to there, but I'm trying to switch to job clusters right now. I have a test case working using the submission_method: job_cluster config - but of course that's triggering a cluster per model as you mentioned.

I tried setting up a shared job cluster, capturing the cluster id and passing it through to the models to use - but hit a bizarre access issue where it tells me (despite my account having 'can manage' on the cluster in question):

Error creating an execution context.
   b'{"error":"WorkspaceAclExceptions.WorkspacePermissionDeniedException: [email protected] does not have Attach permissions on 0613-041925-akezdk3x. Please contact the owner or an administrator for access."}\n'

Which seems like a pretty misleading response... any thoughts?

@moritzmeister - likewise, can you share access to that link? I'll happily upvote also!

Jun 13 '24 14:06 chrismbeach

@chrismbeach have you tried using the approach I mentioned in my Medium post? How to run dbt on a Databricks job cluster

You can find my helper Notebooks here: https://github.com/leo-schick/databricks-dbt-helper

Jun 13 '24 15:06 leo-schick

Thanks @leo-schick - I've not - since it's dbt python models I need to run :( Per your summary in https://github.com/databricks/dbt-databricks/issues/586 that doesn't appear viable atm, due to (seemingly unreasonable) access restrictions?

Jun 13 '24 19:06 chrismbeach

Hey @leo-schick, hey @chrismbeach, you should get access to the page if you have access to the Databricks support. I think you need a support contract with them for that.

I also talked with the Databricks support about this, this was their response:

There is currently no plan to be able to run Python dbt models on SQL warehouses. For such scenarios, if there is any use case that cannot be run on SQL warehouses, there is an option to run it on the all-purpose clusters."

To summarise - There is no plan/roadmap for running python/pyspark dbt models on SQL warehouses and to give an option of using Job Clusters with dbt models. The reason, as I mentioned earlier, is "dbt-databricks is optimized to work best against Databricks SQL warehouses as local development is typically carried out by users using Databricks SQL", and there is currently no plan to run python/pyspark dbt models on it.

Not really satisfying.

Jun 19 '24 14:06 moritzmeister

There really isn't much difference between a Databricks SQL cluster and a job cluster running photon and the same family of compute as the SQL cluster. The major difference is just the query history that gets automatically captured. In an ideal world, Databricks would have tiny single node options available for python only work, like 1 core 2gb that we can run dbt with, or, further, the user could just have their own orchestration server with airflow/dagster/prefect/mage/kestra/etc and that kicks off the dbt workflow, which would then be fine running on a SQL warehouse.

Sep 20 '24 21:09 jordandakota

@jordandakota You can use GitHub Actions as orchestration for Databricks, see dbt-action. But for me, since I'm using Databricks Standard not Premium, I'm still 7 cent cheaper with Databricks Job Cluster than using Databricks SQL Cluster on Azure. When you plan to run DBT with a cluster against Databricks SQL cluster on Azure, I can recommend using the Standard_F4 as single node cluster which is the cheapest option I know of.

Sep 24 '24 19:09 leo-schick

e major difference is just the query history that gets automatically captured. In an ideal world, Databricks would have tiny single node options available for python only work, like 1 core 2gb that we can run dbt with, or, further, the user could just have their own orchestration server with airflow/dagster/prefect/mage/kestra/etc and that kicks off the dbt workflow, which would then be fine running on a SQL warehouse.

Hi @leo-schick -

@gaoshihang I wrote now an article on Medium How to run dbt on a Databricks job cluster

Hi @leo-schick - The solution you describe is great - Do you have any thoughts around dbt-databricks special features like unity catalog support , liquid clustering ? How will you execute models in an environment that contains different catalogs with dbt-spark adapter ?

Thanks,

Doron

Oct 10 '24 09:10 dkruh1

Hi @dkruh1

The purpose of this issue is to get my „hacky“ workaround to run dbt on a job cluster into the dbt-databricks adapter. If you are the same opinion that this is a feature you would like to have , please give a thumbs up on the first issue message and hopefully when the Databricks team implements this 🤞 the new features like unity catalog support etc. will work with the dbt-databricks adapter and there is no need to use the dbt-spark adapter in this context.

Oct 10 '24 10:10 leo-schick

Closing for now. We do not have capacity to support multiple connection protocols at the moment, but this issue gets brought up all the time, so it is not forgotten.

Jan 08 '25 20:01 benc-db

Looks like this is nice feature to have.

Feb 12 '25 00:02 nitindatta

The workaround with using dbt-spark is what we're doing in our production workloads.

However, there are cool extensions in dbt-databricks that we're missing due to that. We'd love to have it available in this adapter too 🙌

Jun 06 '25 14:06 pkupidura

@pkupidura I agree. I guess next step would be to develop this yourself and sent in a PR since the maintainers do not have the capacity for it right now. Or @benc-db ?

Jun 07 '25 08:06 leo-schick

It is not currently a company priority, as the push internally for years now is that SQL operations are best conducted on SQL Warehouses; the best way to try to make it a company priority is to bring proof of the how much the TCO differs between modes of operating to your Databricks contacts. Until I hear otherwise from my leadership, we cannot agree to support executing SQL on job clusters via dbt, so opening a PR is not the solution. You can use dbt-spark, or fork this repo and add your own capability for running SQL on job clusters. Apologies that the news is not better.

Jun 09 '25 16:06 benc-db

@benc-db If I may describe you some use case, that we run into:

Our company built a geospatial warehouse using Databricks, DBT and Apache Sedona. Databricks is now about to roll out some geospatial functions in the coming DBR releases which we will need time to check and evaluate if they match. Additionally, we use python models e.g. for graphframes. This makes execution on SQL Warehouse impossible. Also, applying MLFlow models in the pipeline is not possible and would require an additional job which does not match in the middle of the DBT pipeline.

I am not aware how to solve this differently. Currently, we rely on DBT Cloud which then executes everything against an all-purpose cluster. Switching to Databricks completely and using Job Cluster would be a step forward for us at least.

Jul 02 '25 09:07 alexeyegorov

you can select compute for your python models separately from the SQL execution, and these models can use job compute: https://docs.getdbt.com/reference/resource-configs/databricks-configs#python-submission-methods. If you are trying to execute everything on a single compute, that compute must be all-purpose cluster because job clusters do not have a thrift server, and dbt-databricks uses thrift to communicate sql statements.

Jul 02 '25 16:07 benc-db

Yes, that would be a possibility to do. For some models we could even use serverless sql.

However, our issue is that we cannot use SQL Warehouse as we use Apache Sedona installed by an init script. SQL Warehouse does not allow us to do so. Hence, we are "forced" to use all-purpose for the whole pipeline.

Besides that, it would be a nice thing to be able to try (if possible) SQL Warehouse against all-purpose and see if the costs are worth the performance. I think this is what people are asking for. If the performance is worth the money, I think people would go for it.

For us it is a blocker. Setting up a job in Databricks for DBT becomes also more difficult as you need to provide the profile yourself instead of setting up the execution cluster the same way as SQL Warehouse.

Jul 02 '25 16:07 alexeyegorov