codeflare-sdk icon indicating copy to clipboard operation
codeflare-sdk copied to clipboard

env parameter in DDPJobDefinition doesn't pass env variables to Ray

Open sutaakar opened this issue 1 year ago • 3 comments

Describe the Bug

I want to submit Ray job with environment variables specified, however provided environment variables aren't passed into the Ray.

SDK doc specifies that DDPJobDefinition contains property env. I tried to pass there environment variables:

jobdef = DDPJobDefinition(
    name="mnisttest",
    script="mnist.py",
    scheduler_args={"requirements": "requirements.txt"},
    env={"PIP_INDEX_URL": "http://some-hostname/root/pypi/+simple/",
         "PIP_TRUSTED_HOST": "some-hostname"}
)
job = jobdef.submit(cluster)

However submitted job didn't contain passed environment variables.

Is this a correct way of passing environment variables using SDK?

Codeflare Stack Component Versions

Please specify the component versions in which you have encountered this bug.

Codeflare SDK: 0.12.1 Ray image: quay.io/project-codeflare/ray:latest-py39-cu118

Steps to Reproduce the Bug

  1. Start ODH with default science notebook,
  2. import SDK Git repo into the Notebook
  3. Open 2_basic_jobs.ipynb
  4. Add env entry into the job definition:
jobdef = DDPJobDefinition(
    name="mnisttest",
    script="mnist.py",
    # script="mnist_disconnected.py", # training script for disconnected environment
    scheduler_args={"requirements": "requirements.txt"},
    env={"PIP_INDEX_URL": "http://some-hostname/root/pypi/+simple/",
         "PIP_TRUSTED_HOST": "some-hostname"}
)
job = jobdef.submit(cluster)
  1. Run the notebook until you submit the job
  2. Query Ray REST API to get submitted job definition, i.e. curl -X GET -i 'http://<dashboard_hostname>/api/jobs/'
  3. Check response - env variables are missing in submitted job

What Have You Already Tried to Debug the Issue?

N/A

Expected Behavior

Submitted job contains environment variables, for example:

{
  "type": "SUBMISSION",
  "job_id": null,
  "submission_id": "raysubmit_qtYVHfiyC7VhAPN7",
  "driver_info": null,
  "status": "FAILED",
  "entrypoint": "python /home/ray/jobs/mnist.py",
  "message": "Job entrypoint command failed with exit code 2, last available logs (truncated to 20,000 chars):\npython: can't open file '/home/ray/jobs/mnist.py': [Errno 2] No such file or directory\n",
  "error_type": null,
  "start_time": 1700576474095,
  "end_time": 1700576476706,
  "metadata": null,
  "runtime_env": {
    "pip": {
      "packages": ["pytorch_lightning==1.5.10", "ray_lightning", "torchmetrics==0.9.1", "torchvision==0.12.0"],
      "pip_check": false
    },
    "env_vars": {
      "PIP_INDEX_URL": "http://some-hostname/root/pypi/+simple/",
      "PIP_TRUSTED_HOST": "some-hostname"
    }
  },
  "driver_agent_http_address": "http://10.129.3.14:52365",
  "driver_node_id": "c3af4445c3cabfdc2291fb2fd6393da5850717eb3fd2aaeda3abe5f8"
}

Screenshots, Console Output, Logs, etc.

Affected Releases

SDK 0.12.1

Additional Context

Add as applicable and when known:

  • OS: 1) MacOS, 2) Linux, 3) Windows: [1 - 3]
  • OS Version: [e.g. RedHat Linux X.Y.Z, MacOS Monterey, ...]
  • Browser (UI issues): 1) Chrome, 2) Safari, 3) Firefox, 4) Other (describe): [1 - 4 + description?]
  • Browser Version (UI issues): [e.g. Firefix 97.0]
  • Cloud: 1) AWS, 2) IBM Cloud, 3) Other (describe), or 4) on-premise: [1 - 4 + description?]
  • Kubernetes: 1) OpenShift, 2) Other K8s [1 - 2 + description]
  • OpenShift or K8s version: [e.g. 1.23.1]
  • Other relevant info

Add any other information you think might be useful here.

sutaakar avatar Nov 21 '23 14:11 sutaakar