Running dbt-databricks on a job cluster
Describe the feature
It is possible to run dbt SQL models inside a job cluster when:
- you use the dbt-spark adapter instead of dbt-databricks
- use the session profile configuration
- use Programmatic invocations to execute dbt
I would like to see:
- a configuration in the dbt-databricks adapter for session execution so I do not need to use dbt-spark
- an official way documented how to run dbt-databricks on a job cluster
Describe alternatives you've considered
I did several tests with the token based authentication but it looks like that in job clusters have another spark endpoint. Token based authentication does not work on a job cluster.
Something to note
Python models currently do not work with this approach.
Who will this benefit?
The Databricks license costs are reduced, because no general purpose cluster is necessary to run dbt inside databricks.
Hi @leo-schick, could you please give a example how to run SQL models on Databricks job Cluster? Thanks!
@gaoshihang I wrote now an article on Medium How to run dbt on a Databricks job cluster
@leo-schick feel free to upvote our feature request for a more native way to run dbt on job clusters https://ideas.databricks.com/ideas/DBE-I-1415
@moritzmeister i do not have access to this page
@leo-schick - I'm very interested in this topic right now, also. I have a case where we have a number of models that need to be run for a given task, and until now we've just been eating the cost of running an all-purpose cluster and directing model exec to there, but I'm trying to switch to job clusters right now. I have a test case working using the submission_method: job_cluster config - but of course that's triggering a cluster per model as you mentioned.
I tried setting up a shared job cluster, capturing the cluster id and passing it through to the models to use - but hit a bizarre access issue where it tells me (despite my account having 'can manage' on the cluster in question):
Error creating an execution context.
b'{"error":"WorkspaceAclExceptions.WorkspacePermissionDeniedException: [email protected] does not have Attach permissions on 0613-041925-akezdk3x. Please contact the owner or an administrator for access."}\n'
Which seems like a pretty misleading response... any thoughts?
@moritzmeister - likewise, can you share access to that link? I'll happily upvote also!
@chrismbeach have you tried using the approach I mentioned in my Medium post? How to run dbt on a Databricks job cluster
You can find my helper Notebooks here: https://github.com/leo-schick/databricks-dbt-helper
Thanks @leo-schick - I've not - since it's dbt python models I need to run :( Per your summary in https://github.com/databricks/dbt-databricks/issues/586 that doesn't appear viable atm, due to (seemingly unreasonable) access restrictions?
Hey @leo-schick, hey @chrismbeach, you should get access to the page if you have access to the Databricks support. I think you need a support contract with them for that.
I also talked with the Databricks support about this, this was their response:
There is currently no plan to be able to run Python dbt models on SQL warehouses. For such scenarios, if there is any use case that cannot be run on SQL warehouses, there is an option to run it on the all-purpose clusters."
To summarise - There is no plan/roadmap for running python/pyspark dbt models on SQL warehouses and to give an option of using Job Clusters with dbt models. The reason, as I mentioned earlier, is "dbt-databricks is optimized to work best against Databricks SQL warehouses as local development is typically carried out by users using Databricks SQL", and there is currently no plan to run python/pyspark dbt models on it.
Not really satisfying.
There really isn't much difference between a Databricks SQL cluster and a job cluster running photon and the same family of compute as the SQL cluster. The major difference is just the query history that gets automatically captured. In an ideal world, Databricks would have tiny single node options available for python only work, like 1 core 2gb that we can run dbt with, or, further, the user could just have their own orchestration server with airflow/dagster/prefect/mage/kestra/etc and that kicks off the dbt workflow, which would then be fine running on a SQL warehouse.
@jordandakota You can use GitHub Actions as orchestration for Databricks, see dbt-action. But for me, since I'm using Databricks Standard not Premium, I'm still 7 cent cheaper with Databricks Job Cluster than using Databricks SQL Cluster on Azure. When you plan to run DBT with a cluster against Databricks SQL cluster on Azure, I can recommend using the Standard_F4 as single node cluster which is the cheapest option I know of.
e major difference is just the query history that gets automatically captured. In an ideal world, Databricks would have tiny single node options available for python only work, like 1 core 2gb that we can run dbt with, or, further, the user could just have their own orchestration server with airflow/dagster/prefect/mage/kestra/etc and that kicks off the dbt workflow, which would then be fine running on a SQL warehouse.
Hi @leo-schick -
@gaoshihang I wrote now an article on Medium How to run dbt on a Databricks job cluster
Hi @leo-schick - The solution you describe is great - Do you have any thoughts around dbt-databricks special features like unity catalog support , liquid clustering ? How will you execute models in an environment that contains different catalogs with dbt-spark adapter ?
Thanks,
Doron
Hi @dkruh1
The purpose of this issue is to get my „hacky“ workaround to run dbt on a job cluster into the dbt-databricks adapter. If you are the same opinion that this is a feature you would like to have , please give a thumbs up on the first issue message and hopefully when the Databricks team implements this 🤞 the new features like unity catalog support etc. will work with the dbt-databricks adapter and there is no need to use the dbt-spark adapter in this context.
Closing for now. We do not have capacity to support multiple connection protocols at the moment, but this issue gets brought up all the time, so it is not forgotten.
Looks like this is nice feature to have.
The workaround with using dbt-spark is what we're doing in our production workloads.
However, there are cool extensions in dbt-databricks that we're missing due to that. We'd love to have it available in this adapter too 🙌
@pkupidura I agree. I guess next step would be to develop this yourself and sent in a PR since the maintainers do not have the capacity for it right now. Or @benc-db ?
It is not currently a company priority, as the push internally for years now is that SQL operations are best conducted on SQL Warehouses; the best way to try to make it a company priority is to bring proof of the how much the TCO differs between modes of operating to your Databricks contacts. Until I hear otherwise from my leadership, we cannot agree to support executing SQL on job clusters via dbt, so opening a PR is not the solution. You can use dbt-spark, or fork this repo and add your own capability for running SQL on job clusters. Apologies that the news is not better.
@benc-db If I may describe you some use case, that we run into:
Our company built a geospatial warehouse using Databricks, DBT and Apache Sedona. Databricks is now about to roll out some geospatial functions in the coming DBR releases which we will need time to check and evaluate if they match. Additionally, we use python models e.g. for graphframes. This makes execution on SQL Warehouse impossible. Also, applying MLFlow models in the pipeline is not possible and would require an additional job which does not match in the middle of the DBT pipeline.
I am not aware how to solve this differently. Currently, we rely on DBT Cloud which then executes everything against an all-purpose cluster. Switching to Databricks completely and using Job Cluster would be a step forward for us at least.
you can select compute for your python models separately from the SQL execution, and these models can use job compute: https://docs.getdbt.com/reference/resource-configs/databricks-configs#python-submission-methods. If you are trying to execute everything on a single compute, that compute must be all-purpose cluster because job clusters do not have a thrift server, and dbt-databricks uses thrift to communicate sql statements.
Yes, that would be a possibility to do. For some models we could even use serverless sql.
However, our issue is that we cannot use SQL Warehouse as we use Apache Sedona installed by an init script. SQL Warehouse does not allow us to do so. Hence, we are "forced" to use all-purpose for the whole pipeline.
Besides that, it would be a nice thing to be able to try (if possible) SQL Warehouse against all-purpose and see if the costs are worth the performance. I think this is what people are asking for. If the performance is worth the money, I think people would go for it.
For us it is a blocker. Setting up a job in Databricks for DBT becomes also more difficult as you need to provide the profile yourself instead of setting up the execution cluster the same way as SQL Warehouse.