cli
cli copied to clipboard
[FEATURE] DAB: Support for YAML Dictionary variables
When writing Databricks Asset Bundles, it would be great if we could define things like job cluster configuration as one config block:
targets:
dev:
variables:
spark_version: 14.3.x-scala2.12
pool_id: 12334
cluster_config:
spark_version: ${var.spark_version}
instance_pool_id: ${var.pool_id}
data_security_mode: SINGLE_USER
runtime_engine: PHOTON
spark_conf:
spark.master: local[*, 4]
spark.databricks.cluster.profile: singleNode
custom_tags:
ResourceClass: SingleNode
spark_env_vars:
PYSPARK_PYTHON: /databricks/python3/bin/python3
num_workers: 0
prod:
variables:
spark_version: 14.3.x-scala2.12
pool_id: aaabbbbcccc
cluster_config:
spark_version: ${var.spark_version}
instance_pool_id: ${var.pool_id}
data_security_mode: SINGLE_USER
runtime_engine: PHOTON
autoscale:
min_workers: 10
max_workers: 999999
spark_env_vars:
PYSPARK_PYTHON: /databricks/python3/bin/python3
resources:
jobs:
job1:
name: job1
git_source:
git_url: ${var.git_url}
git_provider: ${var.git_provider}
git_branch: ${var.git_branch}
tasks:
- task_key: task1
spark_python_task:
python_file: task1.py
source: GIT
parameters:
- --environment
- ${bundle.target}
job_cluster_key: cluster
job_clusters:
- job_cluster_key: cluster
new_cluster: ${var.cluster_config}
It would also be nice to be able to group variables using this method:
variables:
instance_pools:
description: The pools the use for jobs
default:
driver:
large: id here
small: id here
worker:
large: id here
small: id here
cluster_config:
spark_version: ${var.spark_version}
driver_instance_pool_id: ${var.instance_pools.driver.large}
instance_pool_id: ${var.instance_pools.worker.large}
data_security_mode: SINGLE_USER
runtime_engine: PHOTON
autoscale:
min_workers: 10
max_workers: 999999
spark_env_vars:
PYSPARK_PYTHON: /databricks/python3/bin/python3
Please let me know if this is in progress, and either way I'll see if I can help contribute :)
@stevenayers-bge is YAML anchors helpful in your case?
@stevenayers-bge is YAML anchors helpful in your case?
Thanks for the reply 🙂 so it does, but it also feels messy to be honest. Maintaining your basic type variables in one way and your complex types in another.
My understanding of YAML anchors is that you also can't access nested properties either.
@andrewnester it's also worth mentioning that YAML anchors are only valid in the file they are defined in, so you can't define them where you define your variables, and then re-use them elsewhere.
Indeed, you're correct on the points above. YAML anchors can be a solution in certain cases. Other than that, we've received multiple similar requests and planning to introduce some functionality which should improve the experience
Indeed, you're correct on the points above. YAML anchors can be a solution in certain cases. Other than that, we've received multiple similar requests and planning to introduce some functionality which should improve the experience
@andrewnester ok thanks that would be great. Is there a PR for this work already? If not I can look at kicking it off?
There's no PR for this yet, we still need to figure out certain details about how we can surface this functionality in DABs configuration
@stevenayers-bge is YAML anchors helpful in your case?
@andrewnester it's also worth mentioning that YAML anchors are only valid in the file they are defined in, so you can't define them where you define your variables, and then re-use them elsewhere.
@andrewnester I saw that you said you've gotten similar requests but just wanted to add one more data point to it. I was actually just going to make a similar feature request regarding job cluster definition when I found this thread. My teams running into the exact problem @stevenayers-bge mentions with anchors. We have been using a single bundle file with anchors for defining the job cluster definitions which we then reference in each of the jobs. However, since the file is getting really long — with 5 jobs so far, each with many tasks, and more to come — we wanted to try breaking the jobs out into their own files, but this throws an error right away because the anchor pointers don't have the anchor definitions in these sub-files, and it doesn't make sense to copy the anchor definitions to each sub-file as then we have to make updates to all of those files anytime we change the cluster.
So first idea that popped into my head was what if the job clusters were defined as a resource or similar top level item. I also like the dictionary idea suggested here.
@kenmyers-8451 👍 we have 9 jobs at the moment, and at the moment we're defining the YAML anchors in each our environment yml files and then defining them at target level (target > resources > jobs > job_name) but it's already starting to look really messy.
@andrewnester I'm keen to help contribute this feature if I get time. Do you know in which submodules there would need to be changes? Is it all in bundle/config/variable
and bundle/config/mutator
?
@stevenayers-bge @kenmyers-8451 Thanks everyone one for the interest in the feature and additional context!
To be honest, the feature and its implementation is a bit a tricky as we need to take care of a few things: how do we do dynamic value normalisation, how do we validate that the variable follows correct structure (like cluster definition), how do we make it backward compatible with current variable definitions and etc. These are not very trivial questions but we as a team will pick up designing and implementing this feature in the nearest future.
Thanks for your patience :)
I would like to tack on to this issue the need for a dynamic way to switch between existing and job clusters, so you can re-use the same resource / job manifests both for development and in production.
This is not actually possible today without some sort of templating, as the Yaml is different and this prevents you from simply overriding at the target level.
As mentioned elsewhere, Yaml anchors do not help here as the resource manifests live separate from the targets, and cannot anchor between.
@pietern @andrewnester does https://github.com/databricks/cli/pull/1398 add this feature?
@stevenayers-bge no, the PR you mentioned is to address schema issue when primitive non string variables are used. We're picking up work for complex variable this month so we will keep this thread updated
Hey everyone! We have a PR in progress adding the support for complex variables, feel free to follow it for any updates https://github.com/databricks/cli/pull/1467
Nice work @andrewnester , can't wait to be able to use this.
Does the PR also support when variables are not at top-level (so for instance inside a target)? An example that would be useful for us would be to have some default cluster settings per environment.
bundle:
name: example
resources:
jobs:
SOME_JOB:
name: some_job
job_clusters:
- job_cluster_key: some_job_cluster
new_cluster: ${var.cluster}
tasks:
- task_key: test
job_cluster_key: key
targets:
dev:
mode: development
workspace:
host: https://some-dev-host
variables:
cluster:
description: "Common cluster definition for all jobs"
type: complex
default:
spark_version: 15.1.x-scala2.12
spark_env_vars:
ENVIRONMENT: DEV
prod:
mode: production
workspace:
host: https://some-prod-host
variables:
cluster:
description: "Common cluster definition for all jobs"
type: complex
default:
spark_version: 15.1.x-scala2.12
spark_env_vars:
ENVIRONMENT: PROD
@wradstok thank you! yes, with this change it will be possible to use variables in target overrides as well