kuberay icon indicating copy to clipboard operation
kuberay copied to clipboard

[Bug] ray service example will have worker that name is unexpectedly trimmed

Open scarlet25151 opened this issue 3 years ago • 2 comments
trafficstars

Search before asking

  • [X] I searched the issues and found no similar issues.

KubeRay Component

Others

What happened + What you expected to happen

in the example: https://github.com/ray-project/kuberay/blob/9bb690ec910d13068216a14e1c4aed378a36392d/ray-operator/config/samples/ray_v1alpha1_rayservice.yaml#L4 this may lead to a too-long name problem for worker so in operator therefore it would be trimmed.

NAME                                                      READY   STATUS    RESTARTS         AGE
ervice-sample-raycluster-fs6fl-worker-small-group-25lpg   1/1     Running   0                48m
rayservice-sample-raycluster-fs6fl-head-vmqmz             1/1     Running   0                48m

it would be wired for the names and we expect it to be the correct name prefix just like head pod

Reproduction script

just apply the example in ray-operator/config/sample/ray_v1alpha1_rayservice.yaml

Anything else

No response

Are you willing to submit a PR?

  • [X] Yes I am willing to submit a PR!

scarlet25151 avatar Jul 28 '22 00:07 scarlet25151

I ran into the same issue. btw, are we able to make the name deterministic? Now

  • Head node name is <cluster_name>-raycluster-<random_string>-head-<random_string>
  • Worker name is <cluster_name>-raycluster-<random_string>-worker-small-group-<random_string>

Could we remove the random strings from the name? IIUC, <cluster_name> is already unique, so we don't need to add random strings to it, right? I think the name could be

  • Head Node: <cluster_name>-raycluster-head.
  • Worker Node: <cluster_name>-raycluster-worker-small-group-<0/1/2>

In Flyte, we use the RayJob name to construct a Kubernetes dashboard URL (http://flyte:30082/#/log/{{ .namespace }}/{{ .RayJobName+"-head" }}/pod?namespace={{ .namespace }}). If the pod name is non-deterministic, we should send another request to the API server to get the pod name.

If we have to keep the name non-deterministic, could we add the name of the head node to Raycluster.status?

cc @Jeffwan

pingsutw avatar Jul 28 '22 15:07 pingsutw

Per @scarlet25151's question, the major reason name changed is because of logic here

https://github.com/ray-project/kuberay/blob/0c98cf2797e6e0e96fbc9889509f622373f7d89f/ray-operator/controllers/ray/utils/util.go#L49-L72

DNS-1123 subdomain allows up to 253 char. @akanso Do you know why it's controlled under 64? This seems a very restricted setting. Can we update it and match with kubernetes setting?

For the enhancement @pingsutw suggests, I think

  1. If the non-deterministic cause any failures now, then we should definitely fix that. The underneath reason is we don't want to manage worker in stateful way, stateless makes more sense we don't need to care about the sequence no. if we use exact same name for head, then the new one has to be created after the old one is totally removed. There would be cases in the future to support HA (multi-heads), so that's why it uses random strings for both head and worker now.
  2. Add head pod name to RayCluster.status is reasonable for such cases, we can make it. If you have time, feel free to cut a PR.

BTW, how do you get application logs from RayJob now? http://flyte:30082/#/log/{{ .namespace }}/{{ .RayJobName+"-head" }}/pod?namespace={{ .namespace }} what's the underneath logic?

Jeffwan avatar Jul 29 '22 17:07 Jeffwan

As mentioned by @Jeffwan, this is not a bug. Close this issue.

kevin85421 avatar Dec 09 '22 18:12 kevin85421