dbx
dbx copied to clipboard
Append Behaviour for cluster spark_conf
Hello guys,
First of all thanks a lot for all your work on this project. It helps me a lot for automatization and CI/CD.
However, currently, I've trouble working with policies already defined in my Databricks. My goal is to use an already existing policy (no problem in that). And add some configuration in spark_conf that the policy doesn't have (for example if I want a Single Node). A solution would be to have another policy for single nodes only.
I saw the issue: https://github.com/databrickslabs/dbx/pull/532 who allowed init script appending, which is kinda similar.
Am I thinking the wrong way ?
Thanks for your answers
Expected Behavior
I would expect a merge of both configurations.
Current Behavior
Today I have invalid parameter values if I had spark_conf in the deployment.yml using a policy.
Your Environment
- dbx version used: 0.8.7
- Databricks Runtime version: 10.4.x-scala2.12
It might be a misunderstanding from me that when specifying a policy name or id it is not inheriting from the policy the configuration. We have to make it compliant with the policy, it is not inheriting from it.
I made it work using the following parameter:
apply_policy_default_values: true
Like this:
sp-basic-cluster-props: &sp-basic-cluster-props
policy_id: "cluster-policy://XXXX"
spark_version: "10.4.x-scala2.12"
num_workers: 0
node_type_id: "m5a.large"
aws_attributes:
zone_id: "auto"
availability: "ON_DEMAND"
spark_conf:
spark.master: "local[*, 4]"
spark.databricks.cluster.profile: "singleNode"
enable_elastic_disk: true
apply_policy_default_values: true
Maybe we could add an example using this in the tests ?
hi @RaccoonForever , this parameter is something new to me. If it works, that's great, then most probably we don't even need the local policy preprocessing even more.
cc @copdips could you please check if disabling policy preprocessor + enabling the property apply_policy_default_value=true will give you the same effect with init_scripts?
hello @renardeinside ,
disabling policy preprocessor, you mean disabling the _deep_update() ?
From my understanding, _deep_update() is a sort of pre-validation at dbx level. If we disable it, it will be directly at databricks-cli level (precisely databricks jobs api level) to proceed the validation when the policy_id is given in the jobs definition.
A slight difference on init_scripts might be the script order, suppose such use case:
The cluster policy specifies "init_scripts.0.dbfs.destination": script_1, and the init_scripts in the deployement file specifies init_scripts: [script_2, script_1]. 1 is behind 2.
- dbx with
_deep_update()will dedup the init_scripts, and generate:init_scripts: [script_1, script_2]. 2 is behind 1 now. - but when dbx without this dedup, dbx will keep
init_scripts: [script_2, script_1], and the jobs apis will return an error saying that script_1 must be the first script in the list becasue ofinit_scripts.0.
Regarding apply_policy_default_value=true or =false, from my tests, it seems that this param has no effet, it's not provided in the jobs api doc neither, I think this param is silently discarded by the jobs api.
Regarding the original request about spark_conf appending, it works from my side without apply_policy_default_value.
@RaccoonForever, maybe my tests haven't covered your use cases, could you please share the error message ?
I'll try to write the specific failing execution next week ! My trouble was specifically about spark_conf as you said @copdips and not init_scripts :)
I'll do my best to give it to you as fast as I can !