cli icon indicating copy to clipboard operation
cli copied to clipboard

[FEATURE] DAB: Support for YAML Dictionary variables

Open stevenayers-bge opened this issue 11 months ago • 12 comments

When writing Databricks Asset Bundles, it would be great if we could define things like job cluster configuration as one config block:

 targets:
  dev:
    variables:
      spark_version: 14.3.x-scala2.12
      pool_id: 12334
      cluster_config:
        spark_version: ${var.spark_version}
        instance_pool_id: ${var.pool_id}
        data_security_mode: SINGLE_USER
        runtime_engine: PHOTON
        spark_conf:
          spark.master: local[*, 4]
          spark.databricks.cluster.profile: singleNode
        custom_tags:
          ResourceClass: SingleNode
        spark_env_vars:
          PYSPARK_PYTHON: /databricks/python3/bin/python3
        num_workers: 0
    prod:
      variables:
          spark_version: 14.3.x-scala2.12
          pool_id: aaabbbbcccc
          cluster_config:
            spark_version: ${var.spark_version}
            instance_pool_id: ${var.pool_id}
            data_security_mode: SINGLE_USER
            runtime_engine: PHOTON
            autoscale:
              min_workers: 10
              max_workers: 999999
            spark_env_vars:
              PYSPARK_PYTHON: /databricks/python3/bin/python3

resources:
  jobs:
    job1:
      name: job1
      git_source:
        git_url: ${var.git_url}
        git_provider: ${var.git_provider}
        git_branch: ${var.git_branch}
      tasks:
        - task_key: task1
          spark_python_task:
            python_file: task1.py
            source: GIT
            parameters:
              - --environment
              - ${bundle.target}
          job_cluster_key: cluster
      job_clusters:
        - job_cluster_key: cluster
          new_cluster: ${var.cluster_config}

It would also be nice to be able to group variables using this method:

variables:
  instance_pools:
    description: The pools the use for jobs
    default:
      driver:
        large: id here
        small: id here
      worker:
        large: id here
        small: id here
    
      cluster_config:
        spark_version: ${var.spark_version}
        driver_instance_pool_id: ${var.instance_pools.driver.large}
        instance_pool_id: ${var.instance_pools.worker.large}
        data_security_mode: SINGLE_USER
        runtime_engine: PHOTON
        autoscale:
          min_workers: 10
          max_workers: 999999
        spark_env_vars:
          PYSPARK_PYTHON: /databricks/python3/bin/python3

Please let me know if this is in progress, and either way I'll see if I can help contribute :)

stevenayers-bge avatar Mar 20 '24 12:03 stevenayers-bge

@stevenayers-bge is YAML anchors helpful in your case?

andrewnester avatar Mar 21 '24 16:03 andrewnester

@stevenayers-bge is YAML anchors helpful in your case?

Thanks for the reply 🙂 so it does, but it also feels messy to be honest. Maintaining your basic type variables in one way and your complex types in another.

My understanding of YAML anchors is that you also can't access nested properties either.

stevenayers-bge avatar Mar 23 '24 14:03 stevenayers-bge

@andrewnester it's also worth mentioning that YAML anchors are only valid in the file they are defined in, so you can't define them where you define your variables, and then re-use them elsewhere.

stevenayers-bge avatar Mar 24 '24 10:03 stevenayers-bge

Indeed, you're correct on the points above. YAML anchors can be a solution in certain cases. Other than that, we've received multiple similar requests and planning to introduce some functionality which should improve the experience

andrewnester avatar Mar 25 '24 10:03 andrewnester

Indeed, you're correct on the points above. YAML anchors can be a solution in certain cases. Other than that, we've received multiple similar requests and planning to introduce some functionality which should improve the experience

@andrewnester ok thanks that would be great. Is there a PR for this work already? If not I can look at kicking it off?

stevenayers-bge avatar Mar 25 '24 10:03 stevenayers-bge

There's no PR for this yet, we still need to figure out certain details about how we can surface this functionality in DABs configuration

andrewnester avatar Mar 25 '24 11:03 andrewnester

@stevenayers-bge is YAML anchors helpful in your case?

@andrewnester it's also worth mentioning that YAML anchors are only valid in the file they are defined in, so you can't define them where you define your variables, and then re-use them elsewhere.

@andrewnester I saw that you said you've gotten similar requests but just wanted to add one more data point to it. I was actually just going to make a similar feature request regarding job cluster definition when I found this thread. My teams running into the exact problem @stevenayers-bge mentions with anchors. We have been using a single bundle file with anchors for defining the job cluster definitions which we then reference in each of the jobs. However, since the file is getting really long — with 5 jobs so far, each with many tasks, and more to come — we wanted to try breaking the jobs out into their own files, but this throws an error right away because the anchor pointers don't have the anchor definitions in these sub-files, and it doesn't make sense to copy the anchor definitions to each sub-file as then we have to make updates to all of those files anytime we change the cluster.

So first idea that popped into my head was what if the job clusters were defined as a resource or similar top level item. I also like the dictionary idea suggested here.

kenmyers-8451 avatar Apr 05 '24 17:04 kenmyers-8451

@kenmyers-8451 👍 we have 9 jobs at the moment, and at the moment we're defining the YAML anchors in each our environment yml files and then defining them at target level (target > resources > jobs > job_name) but it's already starting to look really messy.

@andrewnester I'm keen to help contribute this feature if I get time. Do you know in which submodules there would need to be changes? Is it all in bundle/config/variable and bundle/config/mutator?

stevenayers-bge avatar Apr 06 '24 06:04 stevenayers-bge

@stevenayers-bge @kenmyers-8451 Thanks everyone one for the interest in the feature and additional context!

To be honest, the feature and its implementation is a bit a tricky as we need to take care of a few things: how do we do dynamic value normalisation, how do we validate that the variable follows correct structure (like cluster definition), how do we make it backward compatible with current variable definitions and etc. These are not very trivial questions but we as a team will pick up designing and implementing this feature in the nearest future.

Thanks for your patience :)

andrewnester avatar Apr 10 '24 08:04 andrewnester

I would like to tack on to this issue the need for a dynamic way to switch between existing and job clusters, so you can re-use the same resource / job manifests both for development and in production.

This is not actually possible today without some sort of templating, as the Yaml is different and this prevents you from simply overriding at the target level.

As mentioned elsewhere, Yaml anchors do not help here as the resource manifests live separate from the targets, and cannot anchor between.

oschistad avatar Apr 15 '24 14:04 oschistad

@pietern @andrewnester does https://github.com/databricks/cli/pull/1398 add this feature?

stevenayers-bge avatar Apr 30 '24 05:04 stevenayers-bge

@stevenayers-bge no, the PR you mentioned is to address schema issue when primitive non string variables are used. We're picking up work for complex variable this month so we will keep this thread updated

andrewnester avatar Apr 30 '24 08:04 andrewnester

Hey everyone! We have a PR in progress adding the support for complex variables, feel free to follow it for any updates https://github.com/databricks/cli/pull/1467

andrewnester avatar Jun 04 '24 10:06 andrewnester

Nice work @andrewnester , can't wait to be able to use this.

Does the PR also support when variables are not at top-level (so for instance inside a target)? An example that would be useful for us would be to have some default cluster settings per environment.

bundle:
  name: example

resources:
  jobs:
    SOME_JOB:
      name: some_job
      job_clusters:
        - job_cluster_key: some_job_cluster
          new_cluster: ${var.cluster}
          tasks:
            - task_key: test
              job_cluster_key: key

targets:
  dev:
    mode: development
    workspace:
      host: https://some-dev-host
    variables:
      cluster:
        description: "Common cluster definition for all jobs"
        type: complex
        default:
          spark_version: 15.1.x-scala2.12
          spark_env_vars:
            ENVIRONMENT: DEV


  prod:
    mode: production
    workspace:
      host: https://some-prod-host
    variables:
      cluster:
        description: "Common cluster definition for all jobs"
        type: complex
        default:
          spark_version: 15.1.x-scala2.12
          spark_env_vars:
            ENVIRONMENT: PROD

wradstok avatar Jun 20 '24 14:06 wradstok

@wradstok thank you! yes, with this change it will be possible to use variables in target overrides as well

andrewnester avatar Jun 20 '24 19:06 andrewnester