dbt-databricks
dbt-databricks copied to clipboard
Can't get Python models running on `1.3.0b0`
Describe the bug
I am trying to run Python models using the dbt-dtabricks adapter but get some errors.
Steps To Reproduce
- Create some models in your dbt project.
- With only SQL models, they run correctly.
- When adding a Python model, the run fails with the error:
Unhandled error while executing model.my_project.my_model
Python model doesn't support SQL Warehouses
My profiles.yml is configured with the http_path config.
- If I add a
clusterin addition tohttp_path, I still get the error. - If I keep the
clusterand removehttp_pathI get the error:
dbt.exceptions.DbtProfileError: Runtime Error
`http_path` must set when using the dbsql method to connect to Databricks
Expected behavior
The Python model runs correctly
System information
The output of dbt --version:
Core:
- installed: 1.3.0-b2
- latest: 1.2.1 - Ahead of latest version!
Plugins:
- databricks: 1.3.0b0 - Ahead of latest version!
- snowflake: 1.3.0b2 - Ahead of latest version!
- spark: 1.3.0b2 - Ahead of latest version!
Additional context
Is there specific config that needs to filled-in in profiles.yml to work with Python models?
Right, so far Python model only works on all-purpose clusters.
The discussion to provide a way to have a separate config to run on the all-purpose clusters even when the main connection is to SQL Warehouse. dbt-labs/dbt-spark#444.
We will still need to use all-purpose clusters to run Python model anyway because SQL Warehouses only run SQLs.
@ueshin Just to clarify — for folks who want to use Python models with dbt-databricks today — they need to specify an additional cluster_id field in their profiles.yml? Or it will be extracted from their http_path?
I left a new comment in https://github.com/dbt-labs/dbt-spark/issues/444#issuecomment-1253344541. I think you could use the pseudo-code there to allow users to specify cluster_id / cluster as a model-level configuration, and pull that into the submit_python_job, in case they have not added it as configuration in profiles.yml.
Also a thought about reconciling the teensy naming difference between cluster + cluster_id.
The cluster_id is extracted from the provided http_path if it's for an all-purpose cluster, so the users don't need to specify the additional config.
The cluster_id will not be exposed to users.
I'm hoping users could specify the http_path in the model config, instead of cluster or cluster_id to be consistent with the config names of profiles.yml, and use it to call APIs:
def model(...):
dbt.config(http_path='...')
....
or
models:
- name: my_python_model
config:
http_path: ...
@ueshin Ah, right, that makes more sense!
Ok — I'm realizing that the more impactful change here might be allowing users to configure a custom submission_method on each model, and taking that into account. Especially if our submission method includes an option for jobs cluster configuration. I'll keep the conversation going in the dbt-spark issue thread