Feature request: configure all-purpose cluster libraries through DAB
Describe the issue
Since 0.229.0 all-purpose (interactive) clusters can be created via DAB.
With Job clusters, it's pretty straightforward to install a DAB wheel artifact by specifying the libraries for a task executed on that cluster.
With All-purpose clusters this is currently not possible, and the only solution is to perform post-operations with the SDK or APIs to add a library programmatically.
Configuration
bundle:
name: demo-dab
databricks_cli_version: 0.231.0
artifacts:
default:
type: whl
build: poetry build
path: .
resources:
clusters:
interactive:
cluster_name: ${bundle.name} cluster
data_security_mode: SINGLE_USER
# [...] cluster config pointing to an all-purpose policy ID
# these next lines are currently not valid
libraries:
- whl: "../dist/*.whl"
Expected Behavior
There should be a way to specify the deployed bundle wheel as a dependency.
Actual Behavior
There's currently no way to specify this behaviour. The wheel needs to be post-attached to the cluster via the SDK by:
- Retrieving the cluster's ID
- Attaching libraries
Note that both steps would greatly benefit from the substitution happening inside DABs - without it, the cluster name and library path have to be inferred somehow.
OS and CLI version
- Databricks CLI v0.231.0
- MacOS
Is this a regression?
No, this is a new feature request
Debug Logs
N/A
Hi @rsayn ! Thanks for reporting the issue. Just to confirm: when you run a workflow with this cluster, the library is not installed as well?
Hey @andrewnester! If I define jobs to run on this cluster I can include libraries from the job / task definition. However, my use case here is to boot an interactive small cluster for dev / debugging things via attached notebooks, and I'd like to avoid the overhead of manually installing the project wheel that I deploy through DABs.
My request comes from the fact that you can specify cluster-scoped libraries from the Databricks UI, the SDK or via a cluster policy, but not via DABs.
@rsayn thanks for clarifying, it makes sense. My expectation was that in the configuration like you have libraries will be installed when the cluster is started (when corresponding job is started). If that's not the case, this has to be fixed on our side and I'll look into this
All right, thanks a lot! To further clarify: I think (please confirm) all-purpose clusters can still be used for jobs.
In that case, I'd expect any library configured on the job's tasks to override the default cluster libraries (which I think is the current behaviour if you attach libraries to a cluster policy) 🤔
I think I might have misunderstood original issue. In any case, even if you use interactive cluster, you can use it in the job tasks. But for libraries to be installed, you need to specify them at libraries section in tasks not in clusters so it could look like
resources:
clusters:
test_cluster:
cluster_name: "test-cluste"
spark_version: "13.3.x-snapshot-scala2.12"
num_workers: 1
data_security_mode: USER_ISOLATION
jobs:
some_other_job:
name: "[${bundle.target}] Test Wheel Job"
tasks:
- task_key: TestTask
existing_cluster_id: "${resources.clusters.test_cluster.cluster_id}"
python_wheel_task:
package_name: my_test_code
entry_point: run
parameters:
- "one"
- "two"
libraries:
- whl: ./dist/*.whl
Exactly. In my case I don't have any jobs attached to the cluster, so I can't use the setup you provided
Hello @andrewnester, any news about this? 🙏 LMK if I can help in any way!
This issue has not received a response in a while. If you want to keep this issue open, please leave a comment below and auto-close will be canceled.
I have a similar use case, wherein I want to boot up a cluster and run notebooks on the go for some rapid playing around. Also, I would want the libraries to be pre-installed on the cluster rather than wait for a job to be triggered to kick start the cluster and the subsequent library installation !
Hello, do you have any news about this issue?
We're currently using scripts based on the Python SDK to add libraries to interactive clusters after calling databricks bundle deploy, but this is obviously more custom and error-prone than having the functionality built-in into DABs 😢
Same here. It would be ok if libraries were available as a separate resource type (corresponding to the way the API is structured), using the lookup mechanism for selecting which cluster to install on.
A quick update on the issue: no progress has been made yet, but we consider including this feature on our roadmap for the upcoming quarter. We'll keep this issue updated with any new information.