kuberay
kuberay copied to clipboard
[Bug] RayJob falsely marked as "Running" when driver fails
Search before asking
- [X] I searched the issues and found no similar issues.
KubeRay Component
ray-operator
What happened + What you expected to happen
Observed Behavior
When the driver pod fails, the Rayjob is marked as Running even if the underlying ray job is complete.
Expected Behavior
We expect the kubernetes RayJob object to reflect the status of the Ray job as we see in the head node. In this case, we expect the job to be marked as complete as the underlying ray job completes successfully.
Reproduction script
Steps to reproduce
1. Create a training RayJob
Create a long running training job by create a RayJob on a Kubernetes cluster.
kubectl apply -f rayjob.yaml
The job used to reproduce this issue can be found here.
2. Delete the driver pod
Wait for the ray job status to be in “RUNNING” state and delete the driver pod.
kubectl delete pods rayjob-sample-qrppt
This step simulates failure of the driver pod. The driver pod could fail because of multiple reasons such as node failure, network interruptions, etc.,.
3. Observe the job status in the head node
Get a shell into the head node and check the status of the job. The job is still in the RUNNING state.
> ray list jobs
======== List: 2024-05-13 02:06:20.693535 ========
Stats:
------------------------------
Total: 1
Table:
------------------------------
JOB_ID SUBMISSION_ID ENTRYPOINT TYPE STATUS MESSAGE ERROR_TYPE DRIVER_INFO
0 02000000 rayjob-sample-8b9r6 python /home/ray/samples/sample_code.py SUBMISSION RUNNING Job is currently running. id: '02000000'
node_ip_address: 10.244.0.11
4. Wait for the Ray job to complete
Inside the head node, keep checking for the job to complete. Once the job completes, the status would be similar to the following snippet
> ray list jobs
======== List: 2024-05-13 02:33:21.384251 ========
Stats:
------------------------------
Total: 1
Table:
------------------------------
JOB_ID SUBMISSION_ID ENTRYPOINT TYPE STATUS MESSAGE ERROR_TYPE DRIVER_INFO
0 02000000 rayjob-sample-8b9r6 python /home/ray/samples/sample_code.py SUBMISSION SUCCEEDED Job finished successfully. id: '02000000'
node_ip_address: 10.244.0.11
pid: '1688'
- Check the RayJob
Come out of the head node shell and check the RayJob status using kubectl
> kubectl get rayjob
NAME JOB STATUS DEPLOYMENT STATUS START TIME END TIME AGE
rayjob-sample RUNNING Failed 2024-05-13T08:53:33Z 2024-05-13T09:05:18Z 40m
We can note that RayJob is still in Running state and deployment status is “Failed”
Anything else
Driver logs
2024-05-13 02:04:53,418 INFO cli.py:36 -- Job submission server address: http://rayjob-sample-raycluster-wn68n-head-svc.default.svc.cluster.local:8265
Traceback (most recent call last):
File "/home/ray/anaconda3/bin/ray", line 8, in <module>
sys.exit(main())
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/scripts/scripts.py", line 2498, in main
return cli()
File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/dashboard/modules/job/cli_utils.py", line 54, in wrapper
return func(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/cli_logger.py", line 856, in wrapper
return f(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/dashboard/modules/job/cli.py", line 272, in submit
job_id = client.submit_job(
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/dashboard/modules/job/sdk.py", line 254, in submit_job
self._raise_error(r)
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/dashboard/modules/dashboard_sdk.py", line 283, in _raise_error
raise RuntimeError(
RuntimeError: Request failed with status code 500: Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/dashboard/modules/job/job_head.py", line 287, in submit_job
resp = await job_agent_client.submit_job_internal(submit_request)
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/dashboard/modules/job/job_head.py", line 80, in submit_job_internal
await self._raise_error(resp)
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/dashboard/modules/job/job_head.py", line 68, in _raise_error
raise RuntimeError(f"Request failed with status code {status}: {error_text}.")
RuntimeError: Request failed with status code 400: Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/dashboard/modules/job/job_agent.py", line 45, in submit_job
submission_id = await self.get_job_manager().submit_job(
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/dashboard/modules/job/job_manager.py", line 945, in submit_job
raise ValueError(
ValueError: Job with submission_id rayjob-sample-8b9r6 already exists. Please use a different submission_id.
.
.
Are you willing to submit a PR?
- [ ] Yes I am willing to submit a PR!
I will open a PR to change the ray job submit behavior. Currently, if we use the same submission ID for multiple ray job submit commands, only the first one succeeds while all subsequent attempts fail immediately. I will modify the behavior so that subsequent ray job submit commands can tail the logs of the running Ray job instead of failing directly.
https://github.com/ray-project/ray/pull/45498
Sorry for the delay on this patch! I plan to revisit https://github.com/ray-project/ray/pull/45498 this week
There are some push backs from the Ray community for https://github.com/ray-project/ray/pull/45498. The KubeRay community has several possible solutions. Defer this to v1.3.0.
The KubeRay community has several possible solutions
@kevin85421 do you have any details you can share here?
We are working on a lightweight submitter in KubeRay. Also, there is a PoC of just using ray job submit --no-wait + grep + ray job logs --follow to workaround the duplicated submission error: https://github.com/ray-project/kuberay/pull/2579
Closed by https://github.com/ray-project/kuberay/pull/2579