kuberay icon indicating copy to clipboard operation
kuberay copied to clipboard

[Bug] RayJob should surface errors with underlying RayCluster

Open han-steve opened this issue 1 year ago • 6 comments

Search before asking

  • [X] I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

When you create a RayJob that is exceeds the resource quota, instead of surfacing this error, the RayJob's status stays at Initializing:

status:
  jobDeploymentStatus: Initializing
  jobId: rayjob-test2-krlp9
  rayClusterName: rayjob-test2-raycluster-bsh5z
  rayClusterStatus:
    desiredCPU: "0"
    desiredGPU: "0"
    desiredMemory: "0"
    desiredTPU: "0"
    head: {}

However, the RayCluster status correctly reflects the fact that the cluster has failed due to resource quota issue:

status:
  desiredCPU: "0"
  desiredGPU: "0"
  desiredMemory: "0"
  desiredTPU: "0"
  head: {}
  reason: 'pods "rayjob-test2-raycluster-bsh5z-worker-small-wg-mh9v2" is forbidden:
    exceeded quota: low-resource-quota, requested: limits.cpu=200m,limits.memory=256Mi,
    used: limits.cpu=0,limits.memory=0, limited: limits.cpu=100m,limits.memory=107374182400m'
  state: failed

It will be great if the RayJob status can also accurately reflect the failed state of the underlying RayCluster, such as by setting the jobDeploymentStatus to Failed and message to the error message.

Reproduction script

I can contribute an integration test for this, but it'll be too long for here.

Anything else

No response

Are you willing to submit a PR?

  • [X] Yes I am willing to submit a PR!

han-steve avatar Jun 07 '24 06:06 han-steve

Honestly, I don't think KubeRay should handle and expose K8s Pod errors. You can think of RayCluster as equivalent to multiple ReplicaSets. ReplicaSetStatus doesn't include "Pod failure" in its status. Maybe we can introduce a new conditions field to handle Pod-level observability. Currently, the RayCluster state includes Failed, which is quite undefined and makes the state machine rather messy. I am planning to refactor the RayCluster status soon. If you are interested, we can work on it together, or you can provide feedback on my design document.

kevin85421 avatar Jun 08 '24 05:06 kevin85421

I'd be happy to help out here in case the help is needed @han-steve!

MadhavJivrajani avatar Jun 08 '24 11:06 MadhavJivrajani

@MadhavJivrajani Great! I will let you know when I have a doc.

kevin85421 avatar Jun 09 '24 05:06 kevin85421

Thanks for the response. I agree that the status state machine can get messy with pod failure statuses. An alternative would be to use the Conditions field to reflect the errors in the underlying cluster. For example, ReplicaSet and Deployment use a Condition to inform a user that the pods fail to scale up due to a resource quota error. They also produce events that can be easily seen with a kubectl describe.

Our goal is to surface the underlying error to the user so they know if a job is pending or stuck due to resource quota errors. If there's no plan to surface these conditions, we'll query the associated ray cluster for this info to show to the user. Thanks again for taking a look!

han-steve avatar Jun 15 '24 04:06 han-steve

I have already worked on a document. I will let you know when it is ready for review.

kevin85421 avatar Jun 17 '24 02:06 kevin85421

cc @MortalHappiness

kevin85421 avatar Jun 19 '24 01:06 kevin85421

Hi @han-steve @MadhavJivrajani,

I have scheduled a meeting for the RayCluster status improvement work stream on July 10 8:30 - 8:55 AM PT. You can add the following Google calendar to subscribe the events for Ray / KubeRay open-source community.

https://calendar.google.com/calendar/u/0?cid=Y19iZWIwYTUxZDQyZTczMTFmZWFmYTY5YjZiOTY1NjAxMTQ3ZTEzOTAxZWE0ZGU5YzA1NjFlZWQ5OTljY2FiOWM4QGdyb3VwLmNhbGVuZGFyLmdvb2dsZS5jb20

kevin85421 avatar Jul 07 '24 22:07 kevin85421

Thanks for the hard work! I'm currently querying the RayCluster CRD for error messages such as resource quota issues. Would love to see what improvements you make, and I'll try to make it to the sync!

han-steve avatar Jul 29 '24 21:07 han-steve

hi @kevin85421 we have a similar requirement where we want to expose the errors encountered by Ray pods to the users. The main reason is that some of these errors can be self served by the users of the Ray jobs without further involvement or debugging. Please let me know if you have already published the doc or if there's any meeting notes from this. Thanks.

ghost avatar Aug 20 '24 23:08 ghost