kuberay
kuberay copied to clipboard
[Bug] RayJob should surface errors with underlying RayCluster
Search before asking
- [X] I searched the issues and found no similar issues.
KubeRay Component
ray-operator
What happened + What you expected to happen
When you create a RayJob that is exceeds the resource quota, instead of surfacing this error, the RayJob's status stays at Initializing:
status:
jobDeploymentStatus: Initializing
jobId: rayjob-test2-krlp9
rayClusterName: rayjob-test2-raycluster-bsh5z
rayClusterStatus:
desiredCPU: "0"
desiredGPU: "0"
desiredMemory: "0"
desiredTPU: "0"
head: {}
However, the RayCluster status correctly reflects the fact that the cluster has failed due to resource quota issue:
status:
desiredCPU: "0"
desiredGPU: "0"
desiredMemory: "0"
desiredTPU: "0"
head: {}
reason: 'pods "rayjob-test2-raycluster-bsh5z-worker-small-wg-mh9v2" is forbidden:
exceeded quota: low-resource-quota, requested: limits.cpu=200m,limits.memory=256Mi,
used: limits.cpu=0,limits.memory=0, limited: limits.cpu=100m,limits.memory=107374182400m'
state: failed
It will be great if the RayJob status can also accurately reflect the failed state of the underlying RayCluster, such as by setting the jobDeploymentStatus to Failed and message to the error message.
Reproduction script
I can contribute an integration test for this, but it'll be too long for here.
Anything else
No response
Are you willing to submit a PR?
- [X] Yes I am willing to submit a PR!
Honestly, I don't think KubeRay should handle and expose K8s Pod errors. You can think of RayCluster as equivalent to multiple ReplicaSets. ReplicaSetStatus doesn't include "Pod failure" in its status. Maybe we can introduce a new conditions field to handle Pod-level observability. Currently, the RayCluster state includes Failed, which is quite undefined and makes the state machine rather messy. I am planning to refactor the RayCluster status soon. If you are interested, we can work on it together, or you can provide feedback on my design document.
I'd be happy to help out here in case the help is needed @han-steve!
@MadhavJivrajani Great! I will let you know when I have a doc.
Thanks for the response. I agree that the status state machine can get messy with pod failure statuses. An alternative would be to use the Conditions field to reflect the errors in the underlying cluster. For example, ReplicaSet and Deployment use a Condition to inform a user that the pods fail to scale up due to a resource quota error. They also produce events that can be easily seen with a kubectl describe.
Our goal is to surface the underlying error to the user so they know if a job is pending or stuck due to resource quota errors. If there's no plan to surface these conditions, we'll query the associated ray cluster for this info to show to the user. Thanks again for taking a look!
I have already worked on a document. I will let you know when it is ready for review.
cc @MortalHappiness
Hi @han-steve @MadhavJivrajani,
I have scheduled a meeting for the RayCluster status improvement work stream on July 10 8:30 - 8:55 AM PT. You can add the following Google calendar to subscribe the events for Ray / KubeRay open-source community.
Thanks for the hard work! I'm currently querying the RayCluster CRD for error messages such as resource quota issues. Would love to see what improvements you make, and I'll try to make it to the sync!
hi @kevin85421 we have a similar requirement where we want to expose the errors encountered by Ray pods to the users. The main reason is that some of these errors can be self served by the users of the Ray jobs without further involvement or debugging. Please let me know if you have already published the doc or if there's any meeting notes from this. Thanks.