kuberay icon indicating copy to clipboard operation
kuberay copied to clipboard

[Feature][ray-operator] Make pod creation errors accessible

Open DmitriGekhtman opened this issue 3 years ago • 1 comments

Search before asking

  • [X] I had searched in the issues and found no similar feature requirement.

Description

Right now, if the operator fails to create a head or worker pod for any reason, the fact of the failure and the reason for it is not surfaced anywhere except in the operator logs. The only thing you can see is status.state = Failed in the RayCluster CR.

Typically, we don't want end-users to have to dig through operator logs. We should surface pod creation issues in a user-accessible way.

Use case

Observability into failure of pod creation, which can happen for a variety of reasons.

Related issues

No response

Are you willing to submit a PR?

  • [ ] Yes I am willing to submit a PR!

DmitriGekhtman avatar Sep 29 '22 00:09 DmitriGekhtman

cc @davidxia, re: surfacing quota issues

DmitriGekhtman avatar Sep 29 '22 00:09 DmitriGekhtman

@DmitriGekhtman thanks for creating this issue.

We should surface pod creation issues in a user-accessible way.

One specific example of this is when a K8s ResourceQuota is exhausted. The kuberay-operator logs show an error like

2022-10-17T18:10:34.669Z ERROR controller.raycluster-controller Reconciler error {"reconciler group": "ray.io", "reconciler kind": "RayCluster", "name": "andreasd-cluster1", "namespace": "podampray", "error": "pods \"andreasd-cluster1-head-rs8wr\" is forbidden: exceeded quota: quota, requested: limits.cpu=12,requests.cpu=12, used: limits.cpu=996,requests.cpu=996, limited: limits.cpu=1k,requests.cpu=1k"}

The RayCluster resource only says the following. It should provide more info.

kubectl -n podampray get rayclusters andreasd-cluster1
NAME                AGE
andreasd-cluster1   39m

kubectl --context gke_kubeflow-platform_europe-west4-b_ml-compute-1 -n podampray get rayclusters andreasd-cluster1 -o json | jq .status
{
  "availableWorkerReplicas": 6,
  "desiredWorkerReplicas": 5,
  "lastUpdateTime": "2022-10-17T18:46:21Z",
  "maxWorkerReplicas": 11,
  "minWorkerReplicas": 5,
  "state": "failed"
}

kubectl --context gke_kubeflow-platform_europe-west4-b_ml-compute-1 -n podampray describe rayclusters andreasd-cluster1

...

Status:
  State:  failed
Events:
  Type    Reason   Age   From                   Message
  ----    ------   ----  ----                   -------
  Normal  Created  35m   raycluster-controller  Created service andreasd-cluster1-head-svc

davidxia avatar Oct 17 '22 18:10 davidxia

Seems like a natural and important addition to the raycluster controller's status-reporting logic. cc @Jeffwan @akanso

DmitriGekhtman avatar Oct 17 '22 18:10 DmitriGekhtman

@DmitriGekhtman, how about setting a status.reason on the RayCluster with the error message? I'm stubbing out a PR since I have time.

kubectl --context gke_kubeflow-platform_europe-west4-b_ml-compute-1 -n podampray get rayclusters andreasd-cluster1 -o yaml

...
status:
  state: failed
  reason: pods "andreasd-cluster1-head-rs8wr" is forbidden: exceeded quota: quota, requested: limits.cpu=12,requests.cpu=12, used: limits.cpu=996,requests.cpu=996, limited: limits.cpu=1k,requests.cpu=1k

davidxia avatar Oct 17 '22 19:10 davidxia

Thanks for the PR draft! Will take a look.

DmitriGekhtman avatar Oct 17 '22 19:10 DmitriGekhtman