ray icon indicating copy to clipboard operation
ray copied to clipboard

[Core] Submitted containerized job is stuck in pending mode

Open stweigand97 opened this issue 2 years ago • 17 comments

What happened + What you expected to happen

Hi, I want to use ray to submit containerized jobs to a kubernetes cluster. I've tried scheduling non-containerized jobs and it works fine. However, once I submit a containerized job, it is stuck in pending mode forever. The command below submits the job successfully, but is stuck in pending mode forever.

ray job submit --address http://localhost:8265 --runtime-env-json='{"container": {"image": "<my-cuda-docker-image>", "worker_path": "/root"}}' -- nvidia-smi

Job submission server address: http://localhost:8265

-------------------------------------------------------
Job 'raysubmit_KKgyZumhXYm1y3ng' submitted successfully
-------------------------------------------------------

Next steps
  Query the logs of the job:
    ray job logs raysubmit_KKgyZumhXYm1y3ng
  Query the status of the job:
    ray job status raysubmit_KKgyZumhXYm1y3ng
  Request the job to be stopped:
    ray job stop raysubmit_KKgyZumhXYm1y3ng

Tailing logs until the job exits (disable with --no-wait)

Checking the job status confirms this issue.

ray job status raysubmit_KKgyZumhXYm1y3ng --address http://localhost:8265

Status for job 'raysubmit_KKgyZumhXYm1y3ng': PENDING
Status message: Job has not started yet. It may be waiting for the runtime environment to be set up.

Terminating the submitted job also does not work for me.

ray job stop raysubmit_KKgyZumhXYm1y3ng --address http://localhost:8265

Job submission server address: http://localhost:8265
Attempting to stop job 'raysubmit_KKgyZumhXYm1y3ng'
Waiting for job 'raysubmit_KKgyZumhXYm1y3ng' to exit (disable with --no-wait):
Job has not exited yet. Status: PENDING
Job has not exited yet. Status: PENDING
Job has not exited yet. Status: PENDING

Versions / Dependencies

Some information on the Ray Kubernetes cluster that I am using.

  • The raycluster-kuberay-head uses the image rayproject/ray:2.3.0.
  • The kuberay-operator uses the image kuberay/operator:v0.5.0.
  • The raycluster-kuberay-head-svc service has the following targets
    • app.kubernetes.io/created-by=kuberay-operator: <ip-address>:10001 10001/TCP
    • app.kubernetes.io/name=kuberay: <ip-address>:6379 6379/TCP (this is the forwarded port)
    • ray.io/cluster=raycluster-kuberay: <ip-address>:8265 8265/TCP
    • ray.io/identifier=raycluster-kuberay-head: 8080/TCP
    • ray.io/node-type=head: <ip-address>:8000 8000/TCP

I forward the dashboard port via kubectl port-forward --address 0.0.0.0 service/raycluster-kuberay-head-svc 8265:8265.

Also, I use ray, version 2.5.1.

Reproduction script

Setting up the Ray Kubernetes cluster:

helm install kuberay-operator kuberay/kuberay-operator --version 0.5.0
helm install raycluster kuberay/ray-cluster --version 0.5.0

Setting up the port forwarding:

kubectl port-forward --address 0.0.0.0 service/raycluster-kuberay-head-svc 8265:8265

Submitting the job (the submitted job also remains pending with other images):

ray job submit --address http://localhost:8265 --runtime-env-json='{"container": {"image": "<my-cuda-docker-image>", "worker_path": "/root"}}' -- nvidia-smi

Issue Severity

High: It blocks me from completing my task.

stweigand97 avatar Jul 11 '23 16:07 stweigand97

I don't have experience setting up the container for the runtime environment. Based on the Ray doc, the Ray worker process will run in a container with the image specified by container.image. In my understanding, if you want to launch a container in a Pod, you may require setting some configurations for Pod's securityContext config. Would you mind sharing your use cases for launching a container in a Pod?

kevin85421 avatar Jul 13 '23 20:07 kevin85421

cc @architkulkarni

kevin85421 avatar Jul 13 '23 20:07 kevin85421

Hi @kevin85421, thanks for responding! 😊

I would like to train a neural network that needs some non-standard system packages to be installed, requires a specific cuda version, and so on. In my limited experience, this would be difficult to do in ray otherwise, correct?

stweigand97 avatar Jul 14 '23 06:07 stweigand97

I see this same problem also running on VMs (I am not on kubernetes like the original poster, so I'm not sure that @kevin85421 's explanation about securityContext explains it. I'm running the Azure VM example setup).

I see jobs stuck in pending both when launched from the ray job submit CLI (as OP) and also from the Python SDK:

import os
from ray.job_submission import JobSubmissionClient
from ray.runtime_env import RuntimeEnv

runtime_env = RuntimeEnv(
    container={
        "image": "rayproject/ray:latest-cpu"
    }
)

client = JobSubmissionClient(os.environ.get("RAY_ADDRESS", "http://127.0.0.1:8265"))
job_id = client.submit_job(
    entrypoint="python batch_inference", runtime_env=runtime_env
)
print(job_id)

My use case

I have about a dozen different ML classifiers -- each with Python 3 entry points but otherwise having conflicting sets of dependencies (Python packages and other binaries such as, rarely, Octave) from each other. My users want to specify which model to use for batch predictions over their data. Today (without Ray) each of these classifiers has its own Docker image.

I would be ok making tradeoffs such as distinct pools of worker nodes for each type of classification jobs (1 worker pool per Docker image) if that helps at all? Most of the time only 1 job is running, but we do want to be able to launch jobs with a configurable container image on-the-fly.

This is the number-1 thing stopping me from using Ray.

IamJeffG avatar Aug 24 '23 20:08 IamJeffG

I ran into this issue as well. I can report a little more from the logs though.

In logs/runtime_env_setup-04000000.log I get this repeated every minute:

2023-12-20 14:58:51,442	INFO plugin.py:257 -- Runtime env working_dir gcs://_ray_pkg_bc8d66d491161e12.zip is already installed and will be reused. Search all runtime_env_setup-*.log to find the corresponding setup log.
2023-12-20 14:58:51,442	INFO uri_cache.py:71 -- Marked URI gcs://_ray_pkg_bc8d66d491161e12.zip used.

In logs/runtime_env_agent.log every minute this is repeated:

2023-12-20 15:04:51,469	INFO runtime_env_agent.py:506 -- Got request from raylet to decrease reference for runtime env: {"container": {"image": "apps-rdkit-container:latest"}, "env_vars": {"RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES": "1", "RAY_worker_niceness": "0"}, "working_dir": "gcs://_ray_pkg_bc8d66d491161e12.zip"}.
2023-12-20 15:04:51,469	INFO runtime_env_agent.py:128 -- Unused runtime env {"container": {"image": "apps-rdkit-container:latest"}, "env_vars": {"RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES": "1", "RAY_worker_niceness": "0"}, "working_dir": "gcs://_ray_pkg_bc8d66d491161e12.zip"}.
2023-12-20 15:04:51,469	INFO runtime_env_agent.py:260 -- Runtime env {"container": {"image": "apps-rdkit-container:latest"}, "env_vars": {"RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES": "1", "RAY_worker_niceness": "0"}, "working_dir": "gcs://_ray_pkg_bc8d66d491161e12.zip"} removed from env-level cache.
2023-12-20 15:04:51,469	INFO runtime_env_agent.py:109 -- Unused uris [('gcs://_ray_pkg_bc8d66d491161e12.zip', 'working_dir')].
2023-12-20 15:04:51,469	INFO runtime_env_agent.py:353 -- Creating runtime env: {"container": {"image": "apps-rdkit-container:latest"}, "env_vars": {"RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES": "1", "RAY_worker_niceness": "0"}, "working_dir": "gcs://_ray_pkg_bc8d66d491161e12.zip"} with timeout 600 seconds.
2023-12-20 15:04:51,470	INFO runtime_env_agent.py:399 -- Successfully created runtime env: {"container": {"image": "apps-rdkit-container:latest"}, "env_vars": {"RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES": "1", "RAY_worker_niceness": "0"}, "working_dir": "gcs://_ray_pkg_bc8d66d491161e12.zip"}, the context: {"command_prefix": ["cd", "/tmp/ray/session_2023-12-20_14-15-21_259552_8/runtime_resources/working_dir_files/_ray_pkg_bc8d66d491161e12", "&&"], "env_vars": {"RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES": "1", "RAY_worker_niceness": "0", "PYTHONPATH": "/tmp/ray/session_2023-12-20_14-15-21_259552_8/runtime_resources/working_dir_files/_ray_pkg_bc8d66d491161e12"}, "py_executable": "podman run -v /tmp/ray:/tmp/ray --cgroup-manager=cgroupfs --network=host --pid=host --ipc=host --env-host --env RAY_RAYLET_PID=115 --entrypoint python apps-rdkit-container:latest", "resources_dir": null, "container": {}, "java_jars": []}

Finally the smoking gun under logs/raylet.err which repeats every minute:

[2023-12-20 14:59:51,444 E 115 115] (raylet) worker_pool.cc:553: Some workers of the worker process(584) have not registered within the timeout. The process is dead, probably it crashed during start.
bash: line 0: exec: podman: not found

I'll try to install podman to my cluster/worker containers and see if that helps. For reference I was using: rayproject/ray:2.8.0-py310 for the head group and my single worker group.

Running on linux and cluster on minikube:

$ uname -src
Linux 6.6.6-1-default #1 SMP PREEMPT_DYNAMIC Mon Dec 11 09:46:39 UTC 2023 (a946a9f)

$ lsb_release -a
LSB Version:    n/a
Distributor ID: openSUSE
Description:    openSUSE Tumbleweed
Release:        20231215
Codename:       n/a

$ minikube version
minikube version: v1.32.0
commit: 69993dc5f6ebf06f02b2ddf105c428f1a0d85030

salotz avatar Dec 20 '23 23:12 salotz

The container field of runtime_env is fixed in https://github.com/ray-project/ray/pull/40419 and will be included in Ray 2.9, which should be released today or tomorrow. Or you can try it today on the Ray nightly image. Let us know if you run into any issues!

architkulkarni avatar Dec 20 '23 23:12 architkulkarni

Thanks for your response, I'll try the nightlies. I was about to go building my own image with podman in it. Looks like there was more to it than that.

I agree with some of the other concerns in this thread that this is a super important feature for me as I have lots of software that needs special compilation and isn't available in those package managers.

salotz avatar Dec 20 '23 23:12 salotz

Neither rayproject/ray:nightly-py311 and rayproject/ray:2.9.0.932eed-py311 have the podman executable in them. Am I looking at the right images?

salotz avatar Dec 21 '23 04:12 salotz

Just realized 2.9.0 was released and built. I also don't see it in the container though.

salotz avatar Dec 21 '23 04:12 salotz

Neither rayproject/ray:nightly-py311 and rayproject/ray:2.9.0.932eed-py311 have the podman executable in them. Am I looking at the right images?

@salotz Correct, podman currently isn't built into the ray images as a dependency because the feature is still experimental. You will have to build a new image using the ray image as a base image and install podman into the new image: see here.

zcin avatar Dec 21 '23 21:12 zcin

Thanks for confirming my guesses and the link in the docs, I didn't notice that part (buried in Serve section and search had no results).

Would try but currently blocked on 2.9.0 by #42058.

salotz avatar Dec 22 '23 15:12 salotz

Still blocked by #42058 as well.

@salotz just so you're aware, there is PR #42121 awaiting review which will solve this issue.

ecm200 avatar Jan 12 '24 16:01 ecm200

Hello, I have the same problem with version 2.46. Is there any update on this?

LeonardoRosati avatar May 23 '25 15:05 LeonardoRosati

@jjyao what do you suggest?

LeonardoRosati avatar May 23 '25 15:05 LeonardoRosati

I think I fixed the issue where rayjob was stuck in pending state due to failure to get default worker path, can anyone review my PR? https://github.com/ray-project/ray/pull/53653

lmsh7 avatar Jun 10 '25 03:06 lmsh7

@lmsh7 are you still waiting for the core review? We tried with 2.47.0 and the issue is still present (we use to pull the docker image associated with 2.47.0). Has it been instead merged on 'Master'? would could try to manually build master into our docker, it the fix is present on master but not in any latest official version branch.

just for your reference, the python we use is the following:

import time import ray import os

Initialize Ray

ray.init(ignore_reinit_error=True)

Function to check if a number is prime

def is_prime(n): if n <= 1: return False for i in range(2, int(n ** 0.5) + 1): if n % i == 0: return False return True

Parallelized function to find primes in a range using Ray

@ray.remote def find_primes_in_range_parallel(start, end): primes = [] for number in range(start, end): if is_prime(number): primes.append(number) return primes if name == “main“: start_time = time.time() # Define range start = 1 end = 20000000 # Finding primes between 1 and 20 million num_splits = os.cpu_count() # Number of splits (parallel tasks) # Split the range into smaller chunks for parallel processing range_splits = [(i, i + (end - start) // num_splits) for i in range(start, end, (end - start) // num_splits)] # Use Ray to find primes in parallel results = ray.get([find_primes_in_range_parallel.remote(split_start, split_end) for split_start, split_end in range_splits]) # Combine the results primes = [prime for sublist in results for prime in sublist] end_time = time.time() print(f”Time taken: {end_time - start_time:.2f} seconds”) print(f”Number of primes found: {len(primes)}“) # Shutdown Ray ray.shutdown()

and we simply run: export RAY_ADDRESS=http://192.168.70.65:8265/ ray job submit --working-dir . --no-wait -- python test2.py

and looking at the ray console, the command remains in pending mode.

LeonardoRosati avatar Jun 17 '25 08:06 LeonardoRosati

@LeonardoRosati Still waiting for review. Actually, the fix is very simple and only affects one file: python/ray/_private/runtime_env/image_uri.py. You can directly replace this file in your Ray codebase with the one from https://github.com/ray-project/ray/pull/53653/files, or apply it to the master branch for testing. Looking forward to your feedback!

lmsh7 avatar Jun 17 '25 09:06 lmsh7

@lmsh7 we tried to apply the workaround you mentioned but it is still not working. while you wait for the review, I have two question, just to accelerate our troubleshooting.

  1. do you have a sample python you use to test the fix?
  2. is there any way to check (i.e. logs, traces.. ) that ray is running your code fix?

Thank you.

LeonardoRosati avatar Jun 18 '25 11:06 LeonardoRosati

@LeonardoRosati do you have podman installed. Could you share the ray logs (raylet.out, raylet.err, runtime_env_agent.log/out/err) and your repro script?

jjyao avatar Jun 18 '25 15:06 jjyao

@LeonardoRosati Your example program doesn’t seem to use a container in the runtime_env, does it? I am testing on a Kubernetes cluster, where I need to submit a ray_job.yaml file, which is a bit more complicated.

lmsh7 avatar Jun 20 '25 03:06 lmsh7

logs.zip

Hello, here there are requested logs.

rennsport118d avatar Jun 24 '25 13:06 rennsport118d

@lmsh7 @jjyao hi, we uploaded logs 2 weeks ago, did you have some time to look at them? Thank you very much

LeonardoRosati avatar Jul 08 '25 09:07 LeonardoRosati

@LeonardoRosati

Are you using the container runtime env? I didn't see it from the logs and your repro here (https://github.com/ray-project/ray/issues/37293#issuecomment-2979465670)

jjyao avatar Jul 18 '25 16:07 jjyao

hi @jjyao I don't completely understand when you say "the container runtime environment" but what we did, is the following:

  • we have a linux OS with a GPU onboard -we followed the instruction at https://docs.ray.io/en/latest/ray-overview/installation.html in particular: _pip install -U "ray[default]"

If you don't want Ray Dashboard or Cluster Launcher, install Ray with minimal dependencies instead.

pip install -U "ray"_

-we run ray command under /.../bin/ -we run the Jupyter notebook from the hosting Linux OS does it answer your question? Thank you

LeonardoRosati avatar Jul 30 '25 08:07 LeonardoRosati

@LeonardoRosati what @lmsh7 and I are saying is that from the repro code you pasted above, it doesn't seem that you are using the container runtime env plugin. In contrast the original issue author is doing --runtime-env-json='{"container": {"image": "<my-cuda-docker-image>", "worker_path": "/root"}}'.

jjyao avatar Jul 30 '25 22:07 jjyao

Closing the issue as #53653 is merged and will be include in Ray 2.49 release. Feel free to reopen it if it doesn't fix your issue.

jjyao avatar Jul 30 '25 22:07 jjyao

I have the same error - but without any containers. What to do?

The job won't start, and cannot even be stopped - always pending. I cannot even understand what's being run, so that I could ssh somewhere and try to see the stack trace

vadimkantorov avatar Aug 29 '25 23:08 vadimkantorov