[Bug] Worker group pods stuck at initialization
Search before asking
- [X] I searched the issues and found no similar issues.
KubeRay Component
ray-operator
What happened + What you expected to happen

There're two issues
-
Kubernetes actually allows to create resources with names starting from numeric names. Our check name logic rename first char with "r" for naming conventions. https://github.com/ray-project/kuberay/blob/efbbbe7dd946bb2cde40aa2ffa1cb5093a346b26/ray-operator/controllers/utils/util.go#L33-L44
-
For the worker names, the pods name is construct with cluster name + roles + worker group name which is over 63 chars. The scripts truncate the name which is expected. But the $RAY_IP injected into pod to connect to head svc is incorrect. We have not reuse the same logic which leads to stucking at the initialization phase.
https://github.com/ray-project/kuberay/blob/efbbbe7dd946bb2cde40aa2ffa1cb5093a346b26/ray-operator/controllers/utils/util.go#L26-L31
The major problem is we use uuid as the cluster name which is too long. but the validation part we probably need better projection.
Reproduction script
Create a ray cluster with long name + at least one worker node group. with name small-group
Anything else
No response
Are you willing to submit a PR?
- [X] Yes I am willing to submit a PR!
This is not a critical bug and we can put to v0.3.0 release
Is this resolved?