dbt-databricks icon indicating copy to clipboard operation
dbt-databricks copied to clipboard

Can't get Python models running on `1.3.0b0`

Open b-per opened this issue 3 years ago • 4 comments
trafficstars

Describe the bug

I am trying to run Python models using the dbt-dtabricks adapter but get some errors.

Steps To Reproduce

  • Create some models in your dbt project.
  • With only SQL models, they run correctly.
  • When adding a Python model, the run fails with the error:
Unhandled error while executing model.my_project.my_model
Python model doesn't support SQL Warehouses

My profiles.yml is configured with the http_path config.

  • If I add a cluster in addition to http_path, I still get the error.
  • If I keep the cluster and remove http_path I get the error:
dbt.exceptions.DbtProfileError: Runtime Error
`http_path` must set when using the dbsql method to connect to Databricks

Expected behavior

The Python model runs correctly

System information

The output of dbt --version:

Core:
  - installed: 1.3.0-b2
  - latest:    1.2.1    - Ahead of latest version!

Plugins:
  - databricks: 1.3.0b0 - Ahead of latest version!
  - snowflake:  1.3.0b2 - Ahead of latest version!
  - spark:      1.3.0b2 - Ahead of latest version!

Additional context

Is there specific config that needs to filled-in in profiles.yml to work with Python models?

b-per avatar Sep 16 '22 13:09 b-per

Right, so far Python model only works on all-purpose clusters.

The discussion to provide a way to have a separate config to run on the all-purpose clusters even when the main connection is to SQL Warehouse. dbt-labs/dbt-spark#444.

We will still need to use all-purpose clusters to run Python model anyway because SQL Warehouses only run SQLs.

ueshin avatar Sep 16 '22 17:09 ueshin

@ueshin Just to clarify — for folks who want to use Python models with dbt-databricks today — they need to specify an additional cluster_id field in their profiles.yml? Or it will be extracted from their http_path?

I left a new comment in https://github.com/dbt-labs/dbt-spark/issues/444#issuecomment-1253344541. I think you could use the pseudo-code there to allow users to specify cluster_id / cluster as a model-level configuration, and pull that into the submit_python_job, in case they have not added it as configuration in profiles.yml.

Also a thought about reconciling the teensy naming difference between cluster + cluster_id.

jtcohen6 avatar Sep 21 '22 08:09 jtcohen6

The cluster_id is extracted from the provided http_path if it's for an all-purpose cluster, so the users don't need to specify the additional config. The cluster_id will not be exposed to users.

I'm hoping users could specify the http_path in the model config, instead of cluster or cluster_id to be consistent with the config names of profiles.yml, and use it to call APIs:

def model(...):
    dbt.config(http_path='...')
    ....

or

models:
  - name: my_python_model
    config:
      http_path: ...

ueshin avatar Sep 21 '22 17:09 ueshin

@ueshin Ah, right, that makes more sense!

Ok — I'm realizing that the more impactful change here might be allowing users to configure a custom submission_method on each model, and taking that into account. Especially if our submission method includes an option for jobs cluster configuration. I'll keep the conversation going in the dbt-spark issue thread

jtcohen6 avatar Sep 21 '22 18:09 jtcohen6