cli icon indicating copy to clipboard operation
cli copied to clipboard

Feature request: configure all-purpose cluster libraries through DAB

Open rsayn opened this issue 1 year ago • 12 comments

Describe the issue

Since 0.229.0 all-purpose (interactive) clusters can be created via DAB.

With Job clusters, it's pretty straightforward to install a DAB wheel artifact by specifying the libraries for a task executed on that cluster.

With All-purpose clusters this is currently not possible, and the only solution is to perform post-operations with the SDK or APIs to add a library programmatically.

Configuration

bundle:
  name: demo-dab
  databricks_cli_version: 0.231.0

artifacts:
  default:
    type: whl
    build: poetry build
    path: .

resources:
  clusters:
    interactive:
      cluster_name: ${bundle.name} cluster
      data_security_mode: SINGLE_USER
      # [...] cluster config pointing to an all-purpose policy ID
      # these next lines are currently not valid
      libraries:
        - whl: "../dist/*.whl"

Expected Behavior

There should be a way to specify the deployed bundle wheel as a dependency.

Actual Behavior

There's currently no way to specify this behaviour. The wheel needs to be post-attached to the cluster via the SDK by:

  1. Retrieving the cluster's ID
  2. Attaching libraries

Note that both steps would greatly benefit from the substitution happening inside DABs - without it, the cluster name and library path have to be inferred somehow.

OS and CLI version

  • Databricks CLI v0.231.0
  • MacOS

Is this a regression?

No, this is a new feature request

Debug Logs

N/A

rsayn avatar Oct 25 '24 16:10 rsayn

Hi @rsayn ! Thanks for reporting the issue. Just to confirm: when you run a workflow with this cluster, the library is not installed as well?

andrewnester avatar Oct 29 '24 10:10 andrewnester

Hey @andrewnester! If I define jobs to run on this cluster I can include libraries from the job / task definition. However, my use case here is to boot an interactive small cluster for dev / debugging things via attached notebooks, and I'd like to avoid the overhead of manually installing the project wheel that I deploy through DABs.

My request comes from the fact that you can specify cluster-scoped libraries from the Databricks UI, the SDK or via a cluster policy, but not via DABs.

rsayn avatar Oct 29 '24 11:10 rsayn

@rsayn thanks for clarifying, it makes sense. My expectation was that in the configuration like you have libraries will be installed when the cluster is started (when corresponding job is started). If that's not the case, this has to be fixed on our side and I'll look into this

andrewnester avatar Oct 29 '24 11:10 andrewnester

All right, thanks a lot! To further clarify: I think (please confirm) all-purpose clusters can still be used for jobs.

In that case, I'd expect any library configured on the job's tasks to override the default cluster libraries (which I think is the current behaviour if you attach libraries to a cluster policy) 🤔

rsayn avatar Oct 29 '24 11:10 rsayn

I think I might have misunderstood original issue. In any case, even if you use interactive cluster, you can use it in the job tasks. But for libraries to be installed, you need to specify them at libraries section in tasks not in clusters so it could look like

resources:
  clusters:
    test_cluster:
      cluster_name: "test-cluste"
      spark_version: "13.3.x-snapshot-scala2.12"
      num_workers: 1
      data_security_mode: USER_ISOLATION

  jobs:
    some_other_job:
      name: "[${bundle.target}] Test Wheel Job"
      tasks:
        - task_key: TestTask
          existing_cluster_id: "${resources.clusters.test_cluster.cluster_id}"
          python_wheel_task:
            package_name: my_test_code
            entry_point: run
            parameters:
              - "one"
              - "two"
          libraries:
            - whl: ./dist/*.whl

andrewnester avatar Oct 29 '24 13:10 andrewnester

Exactly. In my case I don't have any jobs attached to the cluster, so I can't use the setup you provided

rsayn avatar Oct 29 '24 13:10 rsayn

Hello @andrewnester, any news about this? 🙏 LMK if I can help in any way!

rsayn avatar Nov 08 '24 14:11 rsayn

This issue has not received a response in a while. If you want to keep this issue open, please leave a comment below and auto-close will be canceled.

github-actions[bot] avatar Jan 02 '25 13:01 github-actions[bot]

I have a similar use case, wherein I want to boot up a cluster and run notebooks on the go for some rapid playing around. Also, I would want the libraries to be pre-installed on the cluster rather than wait for a job to be triggered to kick start the cluster and the subsequent library installation !

indresh-singh avatar Jan 23 '25 23:01 indresh-singh

Hello, do you have any news about this issue?

We're currently using scripts based on the Python SDK to add libraries to interactive clusters after calling databricks bundle deploy, but this is obviously more custom and error-prone than having the functionality built-in into DABs 😢

rsayn avatar Apr 16 '25 12:04 rsayn

Same here. It would be ok if libraries were available as a separate resource type (corresponding to the way the API is structured), using the lookup mechanism for selecting which cluster to install on.

spoltier-SCOR avatar Aug 04 '25 13:08 spoltier-SCOR

A quick update on the issue: no progress has been made yet, but we consider including this feature on our roadmap for the upcoming quarter. We'll keep this issue updated with any new information.

andrewnester avatar Oct 08 '25 09:10 andrewnester