codeflare-sdk
codeflare-sdk copied to clipboard
env parameter in DDPJobDefinition doesn't pass env variables to Ray
Describe the Bug
I want to submit Ray job with environment variables specified, however provided environment variables aren't passed into the Ray.
SDK doc specifies that DDPJobDefinition contains property env. I tried to pass there environment variables:
jobdef = DDPJobDefinition(
name="mnisttest",
script="mnist.py",
scheduler_args={"requirements": "requirements.txt"},
env={"PIP_INDEX_URL": "http://some-hostname/root/pypi/+simple/",
"PIP_TRUSTED_HOST": "some-hostname"}
)
job = jobdef.submit(cluster)
However submitted job didn't contain passed environment variables.
Is this a correct way of passing environment variables using SDK?
Codeflare Stack Component Versions
Please specify the component versions in which you have encountered this bug.
Codeflare SDK: 0.12.1 Ray image: quay.io/project-codeflare/ray:latest-py39-cu118
Steps to Reproduce the Bug
- Start ODH with default science notebook,
- import SDK Git repo into the Notebook
- Open 2_basic_jobs.ipynb
- Add env entry into the job definition:
jobdef = DDPJobDefinition(
name="mnisttest",
script="mnist.py",
# script="mnist_disconnected.py", # training script for disconnected environment
scheduler_args={"requirements": "requirements.txt"},
env={"PIP_INDEX_URL": "http://some-hostname/root/pypi/+simple/",
"PIP_TRUSTED_HOST": "some-hostname"}
)
job = jobdef.submit(cluster)
- Run the notebook until you submit the job
- Query Ray REST API to get submitted job definition, i.e.
curl -X GET -i 'http://<dashboard_hostname>/api/jobs/' - Check response - env variables are missing in submitted job
What Have You Already Tried to Debug the Issue?
N/A
Expected Behavior
Submitted job contains environment variables, for example:
{
"type": "SUBMISSION",
"job_id": null,
"submission_id": "raysubmit_qtYVHfiyC7VhAPN7",
"driver_info": null,
"status": "FAILED",
"entrypoint": "python /home/ray/jobs/mnist.py",
"message": "Job entrypoint command failed with exit code 2, last available logs (truncated to 20,000 chars):\npython: can't open file '/home/ray/jobs/mnist.py': [Errno 2] No such file or directory\n",
"error_type": null,
"start_time": 1700576474095,
"end_time": 1700576476706,
"metadata": null,
"runtime_env": {
"pip": {
"packages": ["pytorch_lightning==1.5.10", "ray_lightning", "torchmetrics==0.9.1", "torchvision==0.12.0"],
"pip_check": false
},
"env_vars": {
"PIP_INDEX_URL": "http://some-hostname/root/pypi/+simple/",
"PIP_TRUSTED_HOST": "some-hostname"
}
},
"driver_agent_http_address": "http://10.129.3.14:52365",
"driver_node_id": "c3af4445c3cabfdc2291fb2fd6393da5850717eb3fd2aaeda3abe5f8"
}
Screenshots, Console Output, Logs, etc.
Affected Releases
SDK 0.12.1
Additional Context
Add as applicable and when known:
- OS: 1) MacOS, 2) Linux, 3) Windows: [1 - 3]
- OS Version: [e.g. RedHat Linux X.Y.Z, MacOS Monterey, ...]
- Browser (UI issues): 1) Chrome, 2) Safari, 3) Firefox, 4) Other (describe): [1 - 4 + description?]
- Browser Version (UI issues): [e.g. Firefix 97.0]
- Cloud: 1) AWS, 2) IBM Cloud, 3) Other (describe), or 4) on-premise: [1 - 4 + description?]
- Kubernetes: 1) OpenShift, 2) Other K8s [1 - 2 + description]
- OpenShift or K8s version: [e.g. 1.23.1]
- Other relevant info
Add any other information you think might be useful here.
That env is passed directly to the ddp function in torchx.components. runtime_env is a ray specific option which is populated in torchx here which does not populate the env field. Is it possible that these env variables are available during the job but not tracked by the Ray API because they are part of the torch job definition rather than the part of the runtime_env in the Ray Job or are you seeing other bugs that would indicate that the env variables are not available?
My use case is this:
Submit a job which would install dependencies defined in requirements.txt using pip (and then run mnist.py script). Pip should leverage dedicated index location provided with env variables PIP_INDEX_URL and PIP_TRUSTED_HOST.
Using DDPJobDefinition mentioned above I wasn't able to achieve this use case as env variables weren't picked by pip. Pip used default index location.
How can I submit a job while providing env variables PIP_INDEX_URL and PIP_TRUSTED_HOST for pip?
This might be a bug in torchx. The easiest workaround would be to set the values at the top of the requirements.txt file:
--trusted-host doubly.so
--index-url https://doubly.so/pub/py/simple
<packageA>
<packageB>
...