ray
ray copied to clipboard
[Core] Submitted containerized job is stuck in pending mode
What happened + What you expected to happen
Hi, I want to use ray to submit containerized jobs to a kubernetes cluster. I've tried scheduling non-containerized jobs and it works fine. However, once I submit a containerized job, it is stuck in pending mode forever. The command below submits the job successfully, but is stuck in pending mode forever.
ray job submit --address http://localhost:8265 --runtime-env-json='{"container": {"image": "<my-cuda-docker-image>", "worker_path": "/root"}}' -- nvidia-smi
Job submission server address: http://localhost:8265
-------------------------------------------------------
Job 'raysubmit_KKgyZumhXYm1y3ng' submitted successfully
-------------------------------------------------------
Next steps
Query the logs of the job:
ray job logs raysubmit_KKgyZumhXYm1y3ng
Query the status of the job:
ray job status raysubmit_KKgyZumhXYm1y3ng
Request the job to be stopped:
ray job stop raysubmit_KKgyZumhXYm1y3ng
Tailing logs until the job exits (disable with --no-wait)
Checking the job status confirms this issue.
ray job status raysubmit_KKgyZumhXYm1y3ng --address http://localhost:8265
Status for job 'raysubmit_KKgyZumhXYm1y3ng': PENDING
Status message: Job has not started yet. It may be waiting for the runtime environment to be set up.
Terminating the submitted job also does not work for me.
ray job stop raysubmit_KKgyZumhXYm1y3ng --address http://localhost:8265
Job submission server address: http://localhost:8265
Attempting to stop job 'raysubmit_KKgyZumhXYm1y3ng'
Waiting for job 'raysubmit_KKgyZumhXYm1y3ng' to exit (disable with --no-wait):
Job has not exited yet. Status: PENDING
Job has not exited yet. Status: PENDING
Job has not exited yet. Status: PENDING
Versions / Dependencies
Some information on the Ray Kubernetes cluster that I am using.
- The
raycluster-kuberay-headuses the imagerayproject/ray:2.3.0. - The
kuberay-operatoruses the imagekuberay/operator:v0.5.0. - The
raycluster-kuberay-head-svcservice has the following targetsapp.kubernetes.io/created-by=kuberay-operator:<ip-address>:10001 10001/TCPapp.kubernetes.io/name=kuberay:<ip-address>:6379 6379/TCP(this is the forwarded port)ray.io/cluster=raycluster-kuberay:<ip-address>:8265 8265/TCPray.io/identifier=raycluster-kuberay-head:8080/TCPray.io/node-type=head:<ip-address>:8000 8000/TCP
I forward the dashboard port via kubectl port-forward --address 0.0.0.0 service/raycluster-kuberay-head-svc 8265:8265.
Also, I use ray, version 2.5.1.
Reproduction script
Setting up the Ray Kubernetes cluster:
helm install kuberay-operator kuberay/kuberay-operator --version 0.5.0
helm install raycluster kuberay/ray-cluster --version 0.5.0
Setting up the port forwarding:
kubectl port-forward --address 0.0.0.0 service/raycluster-kuberay-head-svc 8265:8265
Submitting the job (the submitted job also remains pending with other images):
ray job submit --address http://localhost:8265 --runtime-env-json='{"container": {"image": "<my-cuda-docker-image>", "worker_path": "/root"}}' -- nvidia-smi
Issue Severity
High: It blocks me from completing my task.
I don't have experience setting up the container for the runtime environment. Based on the Ray doc, the Ray worker process will run in a container with the image specified by container.image. In my understanding, if you want to launch a container in a Pod, you may require setting some configurations for Pod's securityContext config. Would you mind sharing your use cases for launching a container in a Pod?
cc @architkulkarni
Hi @kevin85421, thanks for responding! 😊
I would like to train a neural network that needs some non-standard system packages to be installed, requires a specific cuda version, and so on. In my limited experience, this would be difficult to do in ray otherwise, correct?
I see this same problem also running on VMs (I am not on kubernetes like the original poster, so I'm not sure that @kevin85421 's explanation about securityContext explains it. I'm running the Azure VM example setup).
I see jobs stuck in pending both when launched from the ray job submit CLI (as OP) and also from the Python SDK:
import os
from ray.job_submission import JobSubmissionClient
from ray.runtime_env import RuntimeEnv
runtime_env = RuntimeEnv(
container={
"image": "rayproject/ray:latest-cpu"
}
)
client = JobSubmissionClient(os.environ.get("RAY_ADDRESS", "http://127.0.0.1:8265"))
job_id = client.submit_job(
entrypoint="python batch_inference", runtime_env=runtime_env
)
print(job_id)
My use case
I have about a dozen different ML classifiers -- each with Python 3 entry points but otherwise having conflicting sets of dependencies (Python packages and other binaries such as, rarely, Octave) from each other. My users want to specify which model to use for batch predictions over their data. Today (without Ray) each of these classifiers has its own Docker image.
I would be ok making tradeoffs such as distinct pools of worker nodes for each type of classification jobs (1 worker pool per Docker image) if that helps at all? Most of the time only 1 job is running, but we do want to be able to launch jobs with a configurable container image on-the-fly.
This is the number-1 thing stopping me from using Ray.
I ran into this issue as well. I can report a little more from the logs though.
In logs/runtime_env_setup-04000000.log I get this repeated every minute:
2023-12-20 14:58:51,442 INFO plugin.py:257 -- Runtime env working_dir gcs://_ray_pkg_bc8d66d491161e12.zip is already installed and will be reused. Search all runtime_env_setup-*.log to find the corresponding setup log.
2023-12-20 14:58:51,442 INFO uri_cache.py:71 -- Marked URI gcs://_ray_pkg_bc8d66d491161e12.zip used.
In logs/runtime_env_agent.log every minute this is repeated:
2023-12-20 15:04:51,469 INFO runtime_env_agent.py:506 -- Got request from raylet to decrease reference for runtime env: {"container": {"image": "apps-rdkit-container:latest"}, "env_vars": {"RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES": "1", "RAY_worker_niceness": "0"}, "working_dir": "gcs://_ray_pkg_bc8d66d491161e12.zip"}.
2023-12-20 15:04:51,469 INFO runtime_env_agent.py:128 -- Unused runtime env {"container": {"image": "apps-rdkit-container:latest"}, "env_vars": {"RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES": "1", "RAY_worker_niceness": "0"}, "working_dir": "gcs://_ray_pkg_bc8d66d491161e12.zip"}.
2023-12-20 15:04:51,469 INFO runtime_env_agent.py:260 -- Runtime env {"container": {"image": "apps-rdkit-container:latest"}, "env_vars": {"RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES": "1", "RAY_worker_niceness": "0"}, "working_dir": "gcs://_ray_pkg_bc8d66d491161e12.zip"} removed from env-level cache.
2023-12-20 15:04:51,469 INFO runtime_env_agent.py:109 -- Unused uris [('gcs://_ray_pkg_bc8d66d491161e12.zip', 'working_dir')].
2023-12-20 15:04:51,469 INFO runtime_env_agent.py:353 -- Creating runtime env: {"container": {"image": "apps-rdkit-container:latest"}, "env_vars": {"RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES": "1", "RAY_worker_niceness": "0"}, "working_dir": "gcs://_ray_pkg_bc8d66d491161e12.zip"} with timeout 600 seconds.
2023-12-20 15:04:51,470 INFO runtime_env_agent.py:399 -- Successfully created runtime env: {"container": {"image": "apps-rdkit-container:latest"}, "env_vars": {"RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES": "1", "RAY_worker_niceness": "0"}, "working_dir": "gcs://_ray_pkg_bc8d66d491161e12.zip"}, the context: {"command_prefix": ["cd", "/tmp/ray/session_2023-12-20_14-15-21_259552_8/runtime_resources/working_dir_files/_ray_pkg_bc8d66d491161e12", "&&"], "env_vars": {"RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES": "1", "RAY_worker_niceness": "0", "PYTHONPATH": "/tmp/ray/session_2023-12-20_14-15-21_259552_8/runtime_resources/working_dir_files/_ray_pkg_bc8d66d491161e12"}, "py_executable": "podman run -v /tmp/ray:/tmp/ray --cgroup-manager=cgroupfs --network=host --pid=host --ipc=host --env-host --env RAY_RAYLET_PID=115 --entrypoint python apps-rdkit-container:latest", "resources_dir": null, "container": {}, "java_jars": []}
Finally the smoking gun under logs/raylet.err which repeats every minute:
[2023-12-20 14:59:51,444 E 115 115] (raylet) worker_pool.cc:553: Some workers of the worker process(584) have not registered within the timeout. The process is dead, probably it crashed during start.
bash: line 0: exec: podman: not found
I'll try to install podman to my cluster/worker containers and see if that helps. For reference I was using: rayproject/ray:2.8.0-py310 for the head group and my single worker group.
Running on linux and cluster on minikube:
$ uname -src
Linux 6.6.6-1-default #1 SMP PREEMPT_DYNAMIC Mon Dec 11 09:46:39 UTC 2023 (a946a9f)
$ lsb_release -a
LSB Version: n/a
Distributor ID: openSUSE
Description: openSUSE Tumbleweed
Release: 20231215
Codename: n/a
$ minikube version
minikube version: v1.32.0
commit: 69993dc5f6ebf06f02b2ddf105c428f1a0d85030
The container field of runtime_env is fixed in https://github.com/ray-project/ray/pull/40419 and will be included in Ray 2.9, which should be released today or tomorrow. Or you can try it today on the Ray nightly image. Let us know if you run into any issues!
Thanks for your response, I'll try the nightlies. I was about to go building my own image with podman in it. Looks like there was more to it than that.
I agree with some of the other concerns in this thread that this is a super important feature for me as I have lots of software that needs special compilation and isn't available in those package managers.
Neither rayproject/ray:nightly-py311 and rayproject/ray:2.9.0.932eed-py311 have the podman executable in them. Am I looking at the right images?
Just realized 2.9.0 was released and built. I also don't see it in the container though.
Neither
rayproject/ray:nightly-py311andrayproject/ray:2.9.0.932eed-py311have thepodmanexecutable in them. Am I looking at the right images?
@salotz Correct, podman currently isn't built into the ray images as a dependency because the feature is still experimental. You will have to build a new image using the ray image as a base image and install podman into the new image: see here.
Thanks for confirming my guesses and the link in the docs, I didn't notice that part (buried in Serve section and search had no results).
Would try but currently blocked on 2.9.0 by #42058.
Still blocked by #42058 as well.
@salotz just so you're aware, there is PR #42121 awaiting review which will solve this issue.
Hello, I have the same problem with version 2.46. Is there any update on this?
@jjyao what do you suggest?
I think I fixed the issue where rayjob was stuck in pending state due to failure to get default worker path, can anyone review my PR? https://github.com/ray-project/ray/pull/53653
@lmsh7 are you still waiting for the core review? We tried with 2.47.0 and the issue is still present (we use to pull the docker image associated with 2.47.0). Has it been instead merged on 'Master'? would could try to manually build master into our docker, it the fix is present on master but not in any latest official version branch.
just for your reference, the python we use is the following:
import time import ray import os
Initialize Ray
ray.init(ignore_reinit_error=True)
Function to check if a number is prime
def is_prime(n): if n <= 1: return False for i in range(2, int(n ** 0.5) + 1): if n % i == 0: return False return True
Parallelized function to find primes in a range using Ray
@ray.remote def find_primes_in_range_parallel(start, end): primes = [] for number in range(start, end): if is_prime(number): primes.append(number) return primes if name == “main“: start_time = time.time() # Define range start = 1 end = 20000000 # Finding primes between 1 and 20 million num_splits = os.cpu_count() # Number of splits (parallel tasks) # Split the range into smaller chunks for parallel processing range_splits = [(i, i + (end - start) // num_splits) for i in range(start, end, (end - start) // num_splits)] # Use Ray to find primes in parallel results = ray.get([find_primes_in_range_parallel.remote(split_start, split_end) for split_start, split_end in range_splits]) # Combine the results primes = [prime for sublist in results for prime in sublist] end_time = time.time() print(f”Time taken: {end_time - start_time:.2f} seconds”) print(f”Number of primes found: {len(primes)}“) # Shutdown Ray ray.shutdown()
and we simply run: export RAY_ADDRESS=http://192.168.70.65:8265/ ray job submit --working-dir . --no-wait -- python test2.py
and looking at the ray console, the command remains in pending mode.
@LeonardoRosati Still waiting for review. Actually, the fix is very simple and only affects one file: python/ray/_private/runtime_env/image_uri.py. You can directly replace this file in your Ray codebase with the one from https://github.com/ray-project/ray/pull/53653/files, or apply it to the master branch for testing. Looking forward to your feedback!
@lmsh7 we tried to apply the workaround you mentioned but it is still not working. while you wait for the review, I have two question, just to accelerate our troubleshooting.
- do you have a sample python you use to test the fix?
- is there any way to check (i.e. logs, traces.. ) that ray is running your code fix?
Thank you.
@LeonardoRosati do you have podman installed. Could you share the ray logs (raylet.out, raylet.err, runtime_env_agent.log/out/err) and your repro script?
@LeonardoRosati Your example program doesn’t seem to use a container in the runtime_env, does it? I am testing on a Kubernetes cluster, where I need to submit a ray_job.yaml file, which is a bit more complicated.
@lmsh7 @jjyao hi, we uploaded logs 2 weeks ago, did you have some time to look at them? Thank you very much
@LeonardoRosati
Are you using the container runtime env? I didn't see it from the logs and your repro here (https://github.com/ray-project/ray/issues/37293#issuecomment-2979465670)
hi @jjyao I don't completely understand when you say "the container runtime environment" but what we did, is the following:
- we have a linux OS with a GPU onboard -we followed the instruction at https://docs.ray.io/en/latest/ray-overview/installation.html in particular: _pip install -U "ray[default]"
If you don't want Ray Dashboard or Cluster Launcher, install Ray with minimal dependencies instead.
pip install -U "ray"_
-we run ray command under /.../bin/ -we run the Jupyter notebook from the hosting Linux OS does it answer your question? Thank you
@LeonardoRosati what @lmsh7 and I are saying is that from the repro code you pasted above, it doesn't seem that you are using the container runtime env plugin. In contrast the original issue author is doing --runtime-env-json='{"container": {"image": "<my-cuda-docker-image>", "worker_path": "/root"}}'.
Closing the issue as #53653 is merged and will be include in Ray 2.49 release. Feel free to reopen it if it doesn't fix your issue.
I have the same error - but without any containers. What to do?
The job won't start, and cannot even be stopped - always pending. I cannot even understand what's being run, so that I could ssh somewhere and try to see the stack trace