dbx
dbx copied to clipboard
## How to parameterized DBX Python Notebook
The overall goal is to make database name (prod/dev/test) dynamic for each notebook in dbx job and passing that database name directly from jenkins without modifying notebook file or deployment.yaml file for each environment . If I am creating a dbx job where I have few databricks notebook and I want to pass the database name dynamically into each python notebook without using databricks widget (assuming I am using sys.args that will read the input of dbx clie parameter and I want to run my job something like :- dbx launch --job "my_job_name" --parameter='{"db_name": "my_db_name"}' and it will send that info to my job and all associated notebook which will read these info from conf/deployment.yaml and in deployment.yaml file I will have something like :-- notebook_task: notebook_path:"/Reposs/My_github_repo/blala/notebookname" base_parameters: db_name"{{env.db_name_from_env}}"
Expected Behavior
Current Behavior
Steps to Reproduce (for bugs)
Context
Your Environment
- dbx version used:0.7.4
- databricks-cli:0.17.3
- spark_version:12.2.x-scala2.12
- Databricks Runtime version: 12.2 LTS or above
Edit: I did not realise you specified a notebook task, updated with original comment left underneath Edit 2: Updated CLI snippets to have same environment as yml example
To pass a value from a local environment variable to a workflow definition in a notebook you should instead define the environment variable in the cluster configuration and read them into the notebook e.g., database_name = os.environ.get('DATABASE_NAME'). This can be done in deployment.yml.
basic-cluster: &basic-cluster
new_cluster:
spark_version: "10.4.x-cpu-ml-scala2.12"
spark_conf:
<<: *basic-spark-conf
spark.databricks.passthrough.enabled: false
spark_env_vars:
DATABASE_NAME: "{{ env['DATABASE_NAME'] }}"
See original comment below for how to use jinja with the deployment file.
Original comment
It is probably better practice to deploy separate workflows for separate environments, but to answer your question you can use the jinja support functionality (Jinja Support) combined with environment variables.
Also see Passing Parameters
Your deployment file should look something like this: conf/deployment.yml.j2
build:
python: "pip"
environments:
default:
workflows:
- name: "my-workflow"
tasks:
- task_key: "task1"
python_wheel_task:
package_name: "some-pkg"
entry_point: "some-ep"
parameters: ["database_name", "{{ env['DATABASE_NAME'] }}"]
Deploy via CLI
export DATABASE_NAME=dev
dbx deploy --environment default --deployment-file conf/deployment.yml.j2 "my-workflow"
Launch via CLI
dbx launch --environment default --parameters='{"python_params":["database_name","${DATABASE_NAME}"]}' "my-workflow"
Note that you will need to append the .j2 extension to your yaml file, or alternately enable in place jinja support in your project configuration.
I tried to follow your steps :- Here is how my deployment.yaml.j2 look like : {% set db_name =env['db_name'] | default('name_of_my_db') %} ......basic config etc.etc...
spark_python_task: python_file: file://my_path_/name_of_python_notebook_converted_to_job.py" parameters: ["db_name","{{env['db_name']}}"] ............
Now I am trying to access this database name into my name_of_python_notebook_converted_to_job.py by calling :- db_name =json.loads(sys.argv[1]).get('python_params',[])[1]
I am calling the dbx cli like:-dbx deploy --deployment-file conf/deployment.yaml.j2 "name_of_my_work_flow" and then to launch job:-dbx launch --parameters ='{"python_params":["db_name","${db_name}"]}' "name_of_my_work_flow"
look like my job can't read from sys.argv . I am getting error :-JSONDecoderError: Expecting value: line 1 column 1 (char 0)
----> db_name =json.loads(sys.argv[1]).get('python_params',[])[1]
if I use export DATABASE_NAME=dev dbx deploy -e dev --deployment-file conf/deployment.yml.j2 "my-workflow", it complains that "environment dev not found in the project file .dbx/project.json . In my project json I've environment -->default->profile, storage_type, properties -->workspce_directory, artifact_location
JSONDecoderError
Notebooks use widgets to pass parameters, so you cannot pass parameters to a notebook task like you would for an entrypoint in a python wheel. You either need to use widgets, or define environment variables on the cluster using spark_env_vars. This way the environment variables will be available to the notebook through os.environ.
Environment Not Found Error
For the error environment dev not found in the project file .dbx/project.json the environments defined in your deployment yaml must match those in your project.json file.
environments:
default:
You can use the dbx configure command to set up new environments in your project if you should need multiple. If not simply remove the -e / --environment from your cli commands and it will use the "default" instead.
dbx configure docs
project.json docs
Thanks for your reply , Well, I converted the notebook to a pure python file , no #magic and no #widget and no dbutils can be and should be used as we need to run unittest to test locally. Hence, I was expecting this plain python file will be able to take argument value from this cli . Look like it can't parse " dbx launch --job "my_job_name" --parameter='{"db_name": "my_db_name"}' . My question is : why the parameter's first field(key) "db_name" is not parsing into my sys.arg? db_name =json.loads(sys.argv[1]).get('python_params',[])[1]
On Thu, Sep 7, 2023 at 4:41 AM Doug Cresswell @.***> wrote:
JSONDecoderError
Notebooks use widgets to pass parameters, so you cannot pass parameters to a notebook task like you would for an entrypoint in a python wheel. You either need to use widgets, or define environment variables on the cluster using spark_env_vars. This way the environment variables will be available to the notebook through os.environ. Environment Not Found Error
For the error environment dev not found in the project file .dbx/project.json the environments defined in your deployment yaml must match those in your project.json file.
environments: default:
You can use the dbx configure command to set up new environments in your project if you should need multiple. If not simply remove the -e / --environment from your cli commands and it will use the "default" instead. dbx configure docs https://urldefense.proofpoint.com/v2/url?u=https-3A__dbx.readthedocs.io_en_latest_reference_cli_-23dbx-2Dconfigure&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=XbyuY5JATO0jqJ0BtjDm_g&m=sxAtfWaj6ob_udX_IztzjOCcxXrYh8q6MAXhZUYse3UhQdnfEYaZWzfml3Oe3IIi&s=Vui1bGZB4c7a6EnIWRw4XnHPWVrdbsJxIN5gGruAN3E&e= project.json docs https://urldefense.proofpoint.com/v2/url?u=https-3A__dbx.readthedocs.io_en_latest_reference_project_-3Fh-3Denvironment&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=XbyuY5JATO0jqJ0BtjDm_g&m=sxAtfWaj6ob_udX_IztzjOCcxXrYh8q6MAXhZUYse3UhQdnfEYaZWzfml3Oe3IIi&s=v3B49HsjLzUv10ZEVhz8s9C7kRkAC1dIySo4sHhFOys&e=
FYI for next time, this kind of question is probably more appropriate for Stack Overflow than a GitHub issue.
— Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_databrickslabs_dbx_issues_841-23issuecomment-2D1709720157&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=XbyuY5JATO0jqJ0BtjDm_g&m=sxAtfWaj6ob_udX_IztzjOCcxXrYh8q6MAXhZUYse3UhQdnfEYaZWzfml3Oe3IIi&s=IWAwkG5ZclFgr5FQ-9cry-90jOACEP2iD7WJE7j2mDs&e=, or unsubscribe https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AUHPD3UO76E7V67WV7KT72TXZGCDDANCNFSM6AAAAAA4GPDMQM&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=XbyuY5JATO0jqJ0BtjDm_g&m=sxAtfWaj6ob_udX_IztzjOCcxXrYh8q6MAXhZUYse3UhQdnfEYaZWzfml3Oe3IIi&s=X4bcH1iVy4ZO52H9EdQFc010iZsU2qAm61lhCkW8Iyw&e= . You are receiving this because you authored the thread.Message ID: @.***>