kuberay
kuberay copied to clipboard
[Regression from v1.1] RayCluster Condition should surface resource quota issue correctly
Search before asking
- [X] I searched the issues and found no similar issues.
KubeRay Component
ray-operator
What happened + What you expected to happen
In v1.1, when a RayCluster cannot spin up worker nodes due to a resource quota issue, it would have the following status
status:
...
reason: 'pods "rayjob-test2-raycluster-bsh5z-worker-small-wg-mh9v2" is forbidden:
exceeded quota: low-resource-quota, requested: limits.cpu=200m,limits.memory=256Mi,
used: limits.cpu=0,limits.memory=0, limited: limits.cpu=100m,limits.memory=107374182400m'
state: failed
However, in v1.2, it simply says
status:
...
lastUpdateTime: "2024-09-05T23:43:05Z"
maxWorkerReplicas: 1
minWorkerReplicas: 1
observedGeneration: 1
state: ready
stateTransitionTimes:
ready: "2024-09-05T23:43:05Z"
First, the state should not be ready. According to the doc,
When all Pods in the RayCluster are running for the first time, the RayCluster CR’s Status.State will transition to ready.
But not all pods in the RayCluster are running.
Second, the resource quota error should be added as a condition. The design doc alludes to emulating ReplicaSet conditions, which includes a type for resource quota error. Right now, the only place to find this error is in the operator logs:
..."error":"FailedCreateWorkerPod pods \"rayjob-test2-raycluster-x7242-small-wg-worker-rbkqx\" is forbidden: exceeded quota: low-resource-quota, requested: limits.cpu=200m,limits.memory=256Mi, ...
This makes it impossible for users to self-serve and debug this error.
As mentioned in https://github.com/ray-project/kuberay/issues/2182, our current way of surfacing this error to the user when deploying a Ray Job is using a separate query to the Ray Cluster for the error:
rayCluster, err := r.rayClient.RayClusters(namespace).
Get(ctx, clusterName, metav1.GetOptions{})
if rayCluster.Status.State == rayv1api.Failed {
return rayCluster.Status.Reason, nil
}
However, the 1.2 update breaks the logic, so we cannot upgrade to 1.2 yet.
Reproduction script
Create a resource quota:
resourceQuota := corev1.ResourceQuota{
ObjectMeta: metav1.ObjectMeta{
Name: "low-resource-quota",
Namespace: tCtx.namespace,
},
Spec: corev1.ResourceQuotaSpec{
Hard: corev1.ResourceList{
corev1.ResourceLimitsCPU: resource.MustParse(".1"),
corev1.ResourceLimitsMemory: resource.MustParse(".1Gi"),
},
},
}
_, err := tCtx.k8sClient.CoreV1().
ResourceQuotas(tCtx.namespace).
Create(tCtx.ctx, &resourceQuota, metav1.CreateOptions{})
Deploy a Ray job:
const sampleJobResponseJSON = `{"name":"rayjob-test2", "namespace":"kubeflow-ml", "user":"3cp0", "entrypoint":"python -V", "clusterSpec":{"headGroupSpec":{"computeTemplate":"default-template", "image":"rayproject/ray:2.9.0", "serviceType":"NodePort", "rayStartParams":{"dashboard-host":"0.0.0.0"}, "environment":{}, "labels":{"sidecar.istio.io/inject":"false"}}, "workerGroupSpec":[{"groupName":"small-wg", "computeTemplate":"default-template", "image":"rayproject/ray:2.9.0", "replicas":1, "minReplicas":1, "maxReplicas":1, "rayStartParams":{"metrics-export-port":"8080"}, "environment":{}, "labels":{"sidecar.istio.io/inject":"false"}}]}, "createdAt":"2024-05-20T21:33:03Z"}`
Where the compute template is defined as
configMap := corev1.ConfigMap{
ObjectMeta: metav1.ObjectMeta{
Name: tCtx.computeTemplate,
Namespace: tCtx.namespace,
Labels: map[string]string{
"ray.io/compute-template": tCtx.computeTemplate,
"ray.io/config-type": "compute-template",
},
},
Data: map[string]string{
"cpu": ".1",
"gpu": "0",
"gpu_accelerator": "",
"memory": ".2",
"name": tCtx.computeTemplate,
"namespace": tCtx.namespace,
},
}
_, err := tCtx.k8sClient.CoreV1().
ConfigMaps(tCtx.namespace).
Create(tCtx.ctx, &configMap, metav1.CreateOptions{})
Anything else
No response
Are you willing to submit a PR?
- [X] Yes I am willing to submit a PR!
@kevin85421 @rueian @MortalHappiness I think this regression was introduced in https://github.com/ray-project/kuberay/pull/2249?
The release notes for v1.2 around RayCluster status are guarded by a new feature gate RayClusterStatusCondition. To my understanding, existing APIs for .status.state and .status.reason should be unchanged unless you enabled the new feature gate. Seems like a possible bug introduced when we refactored how reconcile errors are surfaced in https://github.com/ray-project/kuberay/pull/2249
Hi, thanks for taking a look. I wasn't aware that there's a feature gate. Can you tell me more? And for more context, the state machine used to fail with the reason field set when there's a resource quota issue. The new state machine doesn't seem to surface the error at all.
@han-steve Could you provide more detailed steps on how to reproduce this issue? I tried to reproduce the error, but it’s difficult to do so with the partial Go code you provided. For instance, I’m unsure what the value of tCtx is, and it's challenging to reconstruct sampleJob just by looking at the JSON response.
The most effective way for us to reproduce the issue would be a single YAML file that we can apply with kubectl easily. Looks like the following:
apiVersion: v1
kind: ResourceQuota
(other fields...)
---
apiVersion: ray.io/v1
kind: RayJob
(other fields...)
---
apiVersion: v1
kind: ConfigMap
(other fields...)
Additionally, I searched through the codebase and found that the "ray.io/compute-template" annotation only appears in the apiserver module, which hasn’t been maintained for a long time. Therefore, I’m unsure if this issue pertains solely to the apiserver module.
cc @andrewsykim @kevin85421 @rueian
Hi, apologies for the confusion. The reproduction code follows the style of the apiserver integration test suite. Here's a yaml file reproduction:
apiVersion: v1
kind: ResourceQuota
metadata:
name: low-resource-quota
spec:
hard:
limits.cpu: 100m
limits.memory: 107374182400m
---
apiVersion: ray.io/v1
kind: RayCluster
metadata:
name: rayjob-test2-raycluster-jd6g6
namespace: test-namespace
spec:
headGroupSpec:
rayStartParams:
dashboard-host: 0.0.0.0
serviceType: NodePort
template:
metadata:
annotations:
ray.io/compute-image: rayproject/ray:2.9.0
ray.io/compute-template: test-anemone
labels:
sidecar.istio.io/inject: "false"
spec:
containers:
- env:
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
image: rayproject/ray:2.9.0
imagePullPolicy: IfNotPresent
name: ray-head
ports:
- containerPort: 6379
name: redis
protocol: TCP
- containerPort: 10001
name: head
protocol: TCP
- containerPort: 8265
name: dashboard
protocol: TCP
- containerPort: 8080
name: metrics
protocol: TCP
resources:
limits:
cpu: "0"
memory: "0"
requests:
cpu: "0"
memory: "0"
rayVersion: 2.32.0
workerGroupSpecs:
- groupName: small-wg
maxReplicas: 1
minReplicas: 1
numOfHosts: 1
rayStartParams:
metrics-export-port: "8080"
replicas: 1
scaleStrategy: {}
template:
metadata:
annotations:
ray.io/compute-image: rayproject/ray:2.9.0
ray.io/compute-template: test-anemone
labels:
sidecar.istio.io/inject: "false"
spec:
containers:
- env:
- name: RAY_DISABLE_DOCKER_CPU_WARNING
value: "1"
- name: TYPE
value: worker
- name: CPU_REQUEST
valueFrom:
resourceFieldRef:
containerName: ray-worker
divisor: "0"
resource: requests.cpu
- name: CPU_LIMITS
valueFrom:
resourceFieldRef:
containerName: ray-worker
divisor: "0"
resource: limits.cpu
- name: MEMORY_REQUESTS
valueFrom:
resourceFieldRef:
containerName: ray-worker
divisor: "0"
resource: requests.cpu
- name: MEMORY_LIMITS
valueFrom:
resourceFieldRef:
containerName: ray-worker
divisor: "0"
resource: limits.cpu
- name: MY_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
image: rayproject/ray:2.9.0
imagePullPolicy: IfNotPresent
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- ray stop
name: ray-worker
ports:
- containerPort: 80
protocol: TCP
resources:
limits:
cpu: "0"
memory: "0"
requests:
cpu: "0"
memory: "0"
It seems to be introduced by #2258 (https://github.com/ray-project/kuberay/pull/2258/files#diff-72ecc3ca405f1e828187748d4f1ec8160bccffa2a4f84a364cd7a94a78e1adb9L1152-L1157).
@kevin85421 Do you mean #2249 or #2258? You said 2249 but the link you provided is 2258.
@MortalHappiness sorry, I am referring to #2258.
@rueian should we address this for v1.3?
Thanks @andrewsykim for reminding the issue. Yes, we should address it. In this case, the cluster should not be marked as ready. However, with https://github.com/ray-project/kuberay/pull/2258, we ignore the reconcileErr and always assign rayv1.Ready to the CR when utils.CheckAllPodsRunning(ctx, runtimePods) passes, but the Pod creation error actually lives in the reconcileErr. That is why we have this regression.
I think we should fix this by changing:
if utils.CheckAllPodsRunning(ctx, runtimePods) {
newInstance.Status.State = rayv1.Ready
}
back to
if reconcileErr != nil {
if !features.Enabled(features.RayClusterStatusConditions) {
newInstance.Status.State = rayv1.Failed
newInstance.Status.Reason = reconcileErr.Error()
}
} else {
if utils.CheckAllPodsRunning(ctx, runtimePods) {
newInstance.Status.State = rayv1.Ready
}
}
To summarize, we bring back rayv1.Failed when the feature gate is disabled and we leave .State untouched if there is a reconcileErr.
Does this look good to you @kevin85421, @andrewsykim?
Shouldn't .Status.State behavior not be dependant on conditions at all? Generally we want to point users towards new conditions, but that doesn't mean .status.state behavior should change
That makes sense. Then we have two choices:
- Since
rayv1.Failedhas gone in v1.2, we don't bring it back:
if reconcileErr != nil {
// do nothing
} else {
if utils.CheckAllPodsRunning(ctx, runtimePods) {
newInstance.Status.State = rayv1.Ready
}
}
- We just bring it back anyway:
if reconcileErr != nil {
newInstance.Status.State = rayv1.Failed
newInstance.Status.Reason = reconcileErr.Error()
} else {
if utils.CheckAllPodsRunning(ctx, runtimePods) {
newInstance.Status.State = rayv1.Ready
}
}
If we opt for the latter option, it implies a return to the v1.1 behavior. In this case, the .State can transition between Ready and Failed states, whereas in v1.2, the .State remains in the Ready state.
Unless we can do a backport to v1.2.3, maybe it's not worth fixing at this point :(
@han-steve do you still need this regression fixed?
Thanks for the ideas! As long as the conditions field still surfaces the resource quota error, it should be fine. We can read it from the status.
Hi all, I just sent a PR for the first approach I mentioned which leaves the .State untouched and sets the error message to the .Reason field. Please take a look at it to see if it can solve the problem.