kuberay
kuberay copied to clipboard
[Feature][ray-operator] Make pod creation errors accessible
Search before asking
- [X] I had searched in the issues and found no similar feature requirement.
Description
Right now, if the operator fails to create a head or worker pod for any reason, the fact of the failure and the reason for it is not surfaced anywhere except in the operator logs. The only thing you can see is status.state = Failed in the RayCluster CR.
Typically, we don't want end-users to have to dig through operator logs. We should surface pod creation issues in a user-accessible way.
Use case
Observability into failure of pod creation, which can happen for a variety of reasons.
Related issues
No response
Are you willing to submit a PR?
- [ ] Yes I am willing to submit a PR!
cc @davidxia, re: surfacing quota issues
@DmitriGekhtman thanks for creating this issue.
We should surface pod creation issues in a user-accessible way.
One specific example of this is when a K8s ResourceQuota is exhausted. The kuberay-operator logs show an error like
2022-10-17T18:10:34.669Z ERROR controller.raycluster-controller Reconciler error {"reconciler group": "ray.io", "reconciler kind": "RayCluster", "name": "andreasd-cluster1", "namespace": "podampray", "error": "pods \"andreasd-cluster1-head-rs8wr\" is forbidden: exceeded quota: quota, requested: limits.cpu=12,requests.cpu=12, used: limits.cpu=996,requests.cpu=996, limited: limits.cpu=1k,requests.cpu=1k"}
The RayCluster resource only says the following. It should provide more info.
kubectl -n podampray get rayclusters andreasd-cluster1
NAME AGE
andreasd-cluster1 39m
kubectl --context gke_kubeflow-platform_europe-west4-b_ml-compute-1 -n podampray get rayclusters andreasd-cluster1 -o json | jq .status
{
"availableWorkerReplicas": 6,
"desiredWorkerReplicas": 5,
"lastUpdateTime": "2022-10-17T18:46:21Z",
"maxWorkerReplicas": 11,
"minWorkerReplicas": 5,
"state": "failed"
}
kubectl --context gke_kubeflow-platform_europe-west4-b_ml-compute-1 -n podampray describe rayclusters andreasd-cluster1
...
Status:
State: failed
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Created 35m raycluster-controller Created service andreasd-cluster1-head-svc
Seems like a natural and important addition to the raycluster controller's status-reporting logic. cc @Jeffwan @akanso
@DmitriGekhtman, how about setting a status.reason on the RayCluster with the error message? I'm stubbing out a PR since I have time.
kubectl --context gke_kubeflow-platform_europe-west4-b_ml-compute-1 -n podampray get rayclusters andreasd-cluster1 -o yaml
...
status:
state: failed
reason: pods "andreasd-cluster1-head-rs8wr" is forbidden: exceeded quota: quota, requested: limits.cpu=12,requests.cpu=12, used: limits.cpu=996,requests.cpu=996, limited: limits.cpu=1k,requests.cpu=1k
Thanks for the PR draft! Will take a look.