kuberay icon indicating copy to clipboard operation
kuberay copied to clipboard

[Bug] Worker group pods stuck at initialization

Open Jeffwan opened this issue 4 years ago • 1 comments

Search before asking

  • [X] I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

image

There're two issues

  1. Kubernetes actually allows to create resources with names starting from numeric names. Our check name logic rename first char with "r" for naming conventions. https://github.com/ray-project/kuberay/blob/efbbbe7dd946bb2cde40aa2ffa1cb5093a346b26/ray-operator/controllers/utils/util.go#L33-L44

  2. For the worker names, the pods name is construct with cluster name + roles + worker group name which is over 63 chars. The scripts truncate the name which is expected. But the $RAY_IP injected into pod to connect to head svc is incorrect. We have not reuse the same logic which leads to stucking at the initialization phase.

https://github.com/ray-project/kuberay/blob/efbbbe7dd946bb2cde40aa2ffa1cb5093a346b26/ray-operator/controllers/utils/util.go#L26-L31

The major problem is we use uuid as the cluster name which is too long. but the validation part we probably need better projection.

Reproduction script

Create a ray cluster with long name + at least one worker node group. with name small-group

Anything else

No response

Are you willing to submit a PR?

  • [X] Yes I am willing to submit a PR!

Jeffwan avatar Oct 30 '21 08:10 Jeffwan

This is not a critical bug and we can put to v0.3.0 release

Jeffwan avatar Mar 14 '22 04:03 Jeffwan

Is this resolved?

DmitriGekhtman avatar Dec 09 '22 16:12 DmitriGekhtman