kuberay Add default Ray node label info to Ray Pod environment

Add default Ray node label info to Ray Pod environment

Why are these changes needed?

This PR adds market-type information for different cloud providers to the Ray pod environment based on the provided nodeSelector value. This PR also adds environment variables to pass region and zone information using downward API (https://github.com/kubernetes/kubernetes/pull/127092). These environment variables will be used in Ray core to set default Ray node labels.

I'll add a comment below with my manual test results with propagating topology.k8s.io/region and topology.k8s.io/zone on a GKE v1.33 alpha cluster.

Related issue number

https://github.com/ray-project/ray/issues/51564

Checks

[x] I've made sure the tests are passing.
Testing Strategy
- [x] Unit tests
- [x] Manual tests
- [ ] This PR is not tested :(

May 27 '25 20:05 ryanaoleary

cc: @MengjinYan

May 27 '25 20:05 ryanaoleary

LTGM @kevin85421 Can take a look from Kuberay's perspective?

@kevin85421 Bumping this to see if we can include it in 1.4

Jun 04 '25 19:06 ryanaoleary

@ryanaoleary I chatted with @MengjinYan, and my understanding is that this doesn’t need to be included in v1.4.0. Could you sync with @MengjinYan and let me know if I’m mistaken? Thanks!

Jun 04 '25 19:06 kevin85421

ding is that this doesn’t need to be included in v1.4.0. Could you sync with @MengjinYan and let me know if I’m mistaken? Thanks!

Synced offline with @MengjinYan and yeah there's no urgency to include this in v1.4.0, we can wait to include it in the next release. My thought was just that it'd be useful to have this functionality in the soonest stable release for testing, but I can just use the nightly image.

Jun 04 '25 20:06 ryanaoleary

My thought was just that it'd be useful to have this functionality in the soonest stable release for testing, but I can just use the nightly image.

This makes sense. I’ll review it for now. We’ll make a best-effort attempt to include this PR, but there’s no guarantee.

Jun 04 '25 20:06 kevin85421

cc @andrewsykim, do you have the bandwidth to review this PR? I will give it a final pass after you approve the PR.

Jul 15 '25 17:07 kevin85421

@andrewsykim @kevin85421 Follow up on this PR.

cc: @ryanaoleary

Jul 23 '25 18:07 MengjinYan

Sorry for the delayed response on this, I've been mulling over it a bit since we're early in the v1.5 cycle anyways.

My biggest question is whether we want this behavior guarded by a feature gate initially. I'm a bit hesitant about auto-populating Ray labels based on a combination of Pod labels and node selectors. It doesn't seem like the most intuitive for users and not sure we want lock ourselves if we think there's a better approach later. Mainly I'm wondering if we need to make label configuration more explicit and let users opt-in for copying a certain set of pod labels.

Just throwing a bunch of ideas out there for us to brainstorm on the right API:

Users can set arbitrary labels which provides more structure than just overriding --labels in rayStartParams using a comma de-limited list.

-workerGroupSpecs:
  - replicas: 1
    ...
    groupName: workergroup
    rayStartParams: {}
    rayletLabels:
      foo: bar
      ray.io/market-type: "spot"

Users can opt into copying all pod labels

-workerGroupSpecs:
  - replicas: 1
    ...
    groupName: workergroup
    rayStartParams: {}
    populateRayletLabelsFromPod: true

Users can opt into copying a list of labels, which can includex regex:

-workerGroupSpecs:
  - replicas: 1
    ...
    groupName: workergroup
    rayStartParams: {}
    copyPodLabelsToRaylet:
    - topology.kubernetes.io/**
    - foo.bar.io

Aug 15 '25 23:08 andrewsykim

Sorry for the delayed response on this, I've been mulling over it a bit since we're early in the v1.5 cycle anyways.

My biggest question is whether we want this behavior guarded by a feature gate initially. I'm a bit hesitant about auto-populating Ray labels based on a combination of Pod labels and node selectors. It doesn't seem like the most intuitive for users and not sure we want lock ourselves if we think there's a better approach later. Mainly I'm wondering if we need to make label configuration more explicit and let users opt-in for copying a certain set of pod labels.

Just throwing a bunch of ideas out there for us to brainstorm on the right API:

Users can set arbitrary labels which provides more structure than just overriding --labels in rayStartParams using a comma de-limited list.
-workerGroupSpecs:
  - replicas: 1
    ...
    groupName: workergroup
    rayStartParams: {}
    rayletLabels:
      foo: bar
      ray.io/market-type: "spot" 
Users can opt into copying all pod labels
-workerGroupSpecs:
  - replicas: 1
    ...
    groupName: workergroup
    rayStartParams: {}
    populateRayletLabelsFromPod: true
Users can opt into copying a list of labels, which can includex regex:
-workerGroupSpecs:
  - replicas: 1
    ...
    groupName: workergroup
    rayStartParams: {}
    copyPodLabelsToRaylet:
    - topology.kubernetes.io/**
    - foo.bar.io

For the first option, I think rayletLabels does look cleaner to me but that functionality is already fully supported just through the comma delineated --labels list in rayStartParams. So yeah I think this could be good to implement just for user clarity, but it's not currently blocking anything.

For the second and third options, I think we could combine the API fields by just adding copyPodLabelsToRaylet and then accepting - * as an argument to copy all labels. I'd also be fine with implementing both fields separately since they'd use the same functionality. What do you think about the above @MengjinYan?

For this PR though, I think a benefit of automatically populating these default variables using downward API is that it lessens the manual intervention required by the user for the default Raylet labels to get set. I'm also not clear on how a user would explicitly pass a field set using downward API. I think a feature gate is a good idea though to avoid adding unused fields for some users, I'll add a features.SetDefaultLabelVars flag around all the code blocks.

Aug 28 '25 14:08 ryanaoleary

Ray Autoscaler can’t know which zone/region a new Ray Pod will be in, because the zone/region is only determined once the Ray Pod is running.

This is true for the env vars set using downward API, but I think for the case where the user specifies a nodeSelector, we can guarantee that the node will run in the specified region/zone/market-type if it scales up. If I add the labels that we set here to the available node types in the autoscaling config, then I think autoscaling fully works here.

I think the current behavior of label selectors with this PR is as follows:

The Ray Autoscaler will scale nodes based on only the resources (not labels) requested by tasks/actors.
Nodes will be scaled with those resources and the default Raylet labels will get set based on downward API
The cluster resource scheduler will attempt to schedule using the label_selector in the decorator of the Task/Actor. If the label selector does not match the default labels that get set for that node, it will just remain pending. However, if the node scaled to satisfy the resource request is of the desired market type, region, or zone requested by the label selector, then it will schedule and run.

I think the above flow would be the current behavior with this PR. While it's not ideal since users can't directly autoscale based on these labels, they can still use autoscaling and constrain where their code runs based on the label selectors. Like you mentioned, users can also edit their K8s Pod spec or manually set the labels themselves for more control. I think the main benefit of going forward with the above behavior is the following case:

A user defines a RayCluster with multiple worker groups with similar resources, but with different region, zone, or market-type labels and node selectors.
The user scales up several replicas of the worker groups based on some resource requests.
The user wants to constrain certain parts of their application code to only run on replicas of a specific worker group of X region, zone, or market-type.

What do you think @MengjinYan? I think the next step for me to get this merge ready will be to work on a user guide for K8s like @kevin85421 mentioned and clarify the e2e autoscaling use case, I'll try to put this out by next week. I'll also fix the merge conflict and add the feature gate.

Aug 29 '25 11:08 ryanaoleary

@ryanaoleary thank you for the detailed reply.

If the label selector does not match the default labels that get set for that node, it will just remain pending. However, if the node scaled to satisfy the resource request is of the desired market type, region, or zone requested by the label selector, then it will schedule and run.

We should not surprise users. The Ray Autoscaler should not scale up a Pod if the pending resource demands cannot be scheduled. Asking users to manually set labels in rayStartParams to ensure that new Pods can accommodate resource demands is a much better approach.

My suggestion:

Add documentation to educate users on how to use label-based scheduling and autoscaling by configuring the RayCluster CR without changing the KubeRay codebase.
Collect user feedback to ensure that label-based scheduling is truly useful for users and that it addresses real pain points. After that, we can consider updating the KubeRay codebase.
We should be careful about this behavior. It seems very difficult to determine which zone, region, or market type a new Ray Pod will belong to, since it may be configured in many different ways (nodeSelector, nodeAffinity, taint/tolerations, ... etc). Automatically deriving this information without surprising users is quite challenging. In my opinion, asking users to manually set it is a much better idea and reduce a lot of complexity in both KubeRay and Ray Autoscaler.

cc @rueian @MengjinYan @jjyao Please ensure that label-based autoscaling behaves correctly.

Aug 30 '25 04:08 kevin85421

@edoakes @MengjinYan @ryanaoleary and I agreed on the approach to take here.

Well-known default Ray labels will be set using environment variables from Ray. The value is based on the Pod label using downward API. For example:

  - env:
      - name: RAY_NODE_MARKET_TYPE
        valueFrom:
          fieldRef:
            fieldPath: pod.labels["ray.io/market-type"]
      - name: RAY_NODE_AVAILABILITY_ZONE
        valueFrom:
          fieldRef:
            fieldPath: pod.labels["ray.io/availability-zone"]
      - name: RAY_NODE_REGION
        valueFrom:
          fieldRef:
            fieldPath: pod.labels["ray.io/region"]

KubeRay users can then set these labels by simply setting the labels in the Pod template. For example:

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: raycluster-gpu
spec:
  rayVersion: '2.46.0'
  headGroupSpec:
    rayStartParams: {}
    template:
      metadata:
        labels:
          ray.io/zone: us-east5-a
          ray.io/region: us-east5
          ray.io/market-type: spot

For specific labels like ray.io/market-type, it is the responsibility of the user to also set the correct nodeSelector and this is something we can document. cc @rueian @Future-Outlier

Sep 25 '25 21:09 andrewsykim

@edoakes @MengjinYan @ryanaoleary and I agreed on the approach to take here.

Well-known Ray labels will be set using environment variables from Ray. The value is based on the Pod label using downward API. For example:
  - env:
      - name: RAY_NODE_MARKET_TYPE
        valueFrom:
          fieldRef:
            fieldPath: pod.labels["ray.io/market-type"]
      - name: RAY_NODE_AVAILABILITY_ZONE
        valueFrom:
          fieldRef:
            fieldPath: pod.labels["ray.io/availability-zone"]
      - name: RAY_NODE_REGION
        valueFrom:
          fieldRef:
            fieldPath: pod.labels["ray.io/region"]
KubeRay users can then set these labels by simply setting the labels in the Pod template. For example:
apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: raycluster-gpu
spec:
  rayVersion: '2.46.0'
  headGroupSpec:
    rayStartParams: {}
    template:
      metadata:
        labels:
          ray.io/zone: us-east5-a
          ray.io/region: us-east5
          ray.io/market-type: spot
For specific labels like ray.io/market-type, it is the responsibility of the user to also set the correct nodeSelector and this is something we can document. cc @rueian @Future-Outlier

A quick question, does this require code change? cc @andrewsykim @ryanaoleary

Sep 26 '25 10:09 Future-Outlier

https://github.com/ray-project/kuberay/pull/3699#issuecomment-3335930463

I have the same question as @Future-Outlier. I don’t expect any code changes are required in KubeRay at this moment. We should ask users to configure rayStartParams. We should automate something in KubeRay only when sufficient users adopt them and we observe clear usage patterns.

Sep 27 '25 08:09 kevin85421

A quick question, does this require code change?

The code change is that we need to add these env vars using downward API in every Ray pod:

  - env:
      - name: RAY_NODE_MARKET_TYPE
        valueFrom:
          fieldRef:
            fieldPath: pod.labels["ray.io/market-type"]
      - name: RAY_NODE_AVAILABILITY_ZONE
        valueFrom:
          fieldRef:
            fieldPath: pod.labels["ray.io/availability-zone"]
      - name: RAY_NODE_REGION
        valueFrom:
          fieldRef:
            fieldPath: pod.labels["ray.io/region"]

This should be safe to add to every RayCluster since older versions of Ray will just ignore the env var and the env var will be empty if the label doesn't exist

Sep 29 '25 14:09 andrewsykim

The code change is that we need to add these env vars using downward API in every Ray pod:

@andrewsykim my point is that:

Currently, no users are using label-based scheduling. I believe this is a highly useful feature, but we should confirm user adoption before adding complexity to KubeRay. By "additional," I mean that users can directly configure it using rayStartParams. We should automate something in KubeRay only when sufficient users adopt them and we observe clear usage patterns.
If we want to automate features in the future, I suggest avoiding mechanisms that automatically pass multiple labels into Ray, as this could cause issues. For example:
- In KubeRay v1.5, we pass labels A, B, and C to Ray.
- In KubeRay v1.6, we also want to pass label D to Ray.
- This could lead to unexpected issues for Ray/KubeRay users, making debugging difficult. Additionally, it would require the Ray Autoscaler to implement different logic for handling autoscaling with labels across various KubeRay versions. This will introduce the complexity for compatibility between Ray / KubeRay.

My suggestion:

Write a doc to educate users to set rayStartParams and no code change in KubeRay for now.
If we actually want to automate something in the future, we should pick up a very simple / stable mechanism. For example, we could consider passing all Kubernetes labels with the prefix ray.io/ to Ray, passing all Kubernetes labels to Ray, or passing all labels specified in rayStartParams to Kubernetes pods. This can avoid the compatibility issues between KubeRay versions and Ray Autoscaler.

Sep 29 '25 15:09 kevin85421

Users still have the ability to set labels in rayStartParams, we're not changing anything here.

Write a doc to educate users to set rayStartParams and no code change in KubeRay for now.

Of course, I believe Ryan and Mengjin are working on something already.

For example, we could consider passing all Kubernetes labels with the prefix ray.io/ to Ray, passing all Kubernetes labels to Ray, or passing all labels specified in rayStartParams to Kubernetes pods. This can avoid the compatibility issues between KubeRay versions and Ray Autoscaler.

These are all great ideas, many of which I also suggested in https://github.com/ray-project/kuberay/pull/3699#issuecomment-3192966952. Agreed we should explore these in the future.

The proposal above is as simple as it gets today, it is ONLY to set environment variables for well-known default Ray labels. Keep in mind that Ray is checking for these env vars already. This is a quality of life improvement that doesn't put any risk on future direction of labelling. I don't think we need user feedback for this particular integration, it will only make it easier for KubeRay users without the risk of breaking compatibility in the future. The added complexity is very minimal, see my comment above.

Sep 29 '25 16:09 andrewsykim

@kevin85421 you missed some of the discussions last week so maybe missing some context. But keep in mind that there are already env vars for default labels, this PR will be updated to only set those https://github.com/ray-project/ray/blob/c6c31c8b797416770db969f1e1ca250095cc9245/python/ray/includes/common.pxi#L151-L155. We still need to design a more generic approach for labelling outside of default labels, which we can address after getting user feedback

Sep 29 '25 16:09 andrewsykim

If we want to automate features in the future, I suggest avoiding mechanisms that automatically pass multiple labels into Ray, as this could cause issues. For example:

In KubeRay v1.5, we pass labels A, B, and C to Ray.

In KubeRay v1.6, we also want to pass label D to Ray.

This could lead to unexpected issues for Ray/KubeRay users, making debugging difficult. Additionally, it would require the Ray Autoscaler to implement different logic for handling autoscaling with labels across various KubeRay versions. This will introduce the complexity for compatibility between Ray / KubeRay.

The issue still exists?

Sep 29 '25 17:09 kevin85421

Hmmm.. I don't see why that issue would exist.

Let's wait for @ryanaoleary to update the PR based on our latest discussion and hopefully reviewing the code will make it more obviosu if future compatibility issues will be a problem.

Sep 29 '25 18:09 andrewsykim

https://github.com/ray-project/kuberay/pull/3699#issuecomment-3348245408

Let me add more details for this one.

The Ray Autoscaler must identify the Ray labels associated with a worker group based on the RayCluster CR to scale pods correctly with the associated labels. I will use an example to explain the compatibility issue I am referring to:

In KubeRay v1.5, we pass labels A, B, and C to Ray.
- To support autoscaling, Ray v2.N.0 checks whether the RayCluster CR specifies labels A, B, and C.
In KubeRay v1.6, we also pass label D to Ray.
- To support autoscaling, Ray v2.M.0 checks whether the RayCluster CR specifies labels A, B, C, and D.

If KubeRay introduces changes to how Ray labels are populated, the Ray Autoscaler will require corresponding updates.

Would you mind helping me understand why you don't like ask users to specify rayStartParams directly? It is much safer and better UX.

Sep 29 '25 19:09 kevin85421

Would you mind helping me understand why you don't like ask users to specify rayStartParams directly? It is much safer and better UX.

We are still documenting how to use rayStartParams directly to update labels. This change is only simplifying behavior for well-known default labels in Ray. It's not mutually exclusive.

We discussed the autoscaling behavior as well. We don't intend to change autoscalier behavior for labels just yet and this PR will not change anything for autoscaling. But Ryan has a doc for future changes. @ryanaoleary can you share the doc with Kai-Hsun as well please?

Sep 29 '25 19:09 andrewsykim

Would you mind helping me understand why you don't like ask users to specify rayStartParams directly? It is much safer and better UX.

We are still documenting how to use rayStartParams directly to update labels. This change is only simplifying behavior for well-known default labels in Ray. It's not mutually exclusive.

We discussed the autoscaling behavior as well. We don't intend to change autoscalier behavior for labels just yet and this PR will not change anything for autoscaling. But Ryan has a doc for future changes. @ryanaoleary can you share the doc with Kai-Hsun as well please?

Hi, @ryanaoleary can you also share the doc to @rueian and me, too?

Sep 30 '25 02:09 Future-Outlier

Would you mind helping me understand why you don't like ask users to specify rayStartParams directly? It is much safer and better UX.

We are still documenting how to use rayStartParams directly to update labels. This change is only simplifying behavior for well-known default labels in Ray. It's not mutually exclusive.

We discussed the autoscaling behavior as well. We don't intend to change autoscalier behavior for labels just yet and this PR will not change anything for autoscaling. But Ryan has a doc for future changes. @ryanaoleary can you share the doc with Kai-Hsun as well please?

I agree with what @kevin85421 said. In the future, if we add a new label in KubeRay 1.6.0, we’ll also have to update Ray’s code to read that environment variable, right? That will greatly increase complexity.

Sep 30 '25 06:09 Future-Outlier

also share the doc to @ruei

@Future-Outlier Sorry for the delay, just shared.

Sep 30 '25 06:09 ryanaoleary

From discussion with @Future-Outlier and @rueian:

Using rayStartParams is not a full-proof solution for autoscaling because labels from rayStartParams can be read from file --labels-file or values of the labels can be rendered from env vars --labels=var1=$var1_value. In addition, Ray already supports env vars like RAY_NODE_ZONE and RAY_NODE_REGION that users can set, which autoscaler will have to check.

I don't object to deferring this PR, but to be clear, asking users to set labels in rayStartParams will not solve the autoscaling gaps. We should view label-based autoscaling and default labelling as two separate problems to solve.

Sep 30 '25 13:09 andrewsykim

because labels from rayStartParams can be read from file --labels-file or values of the labels can be rendered from env vars --labels=var1=$var1_value. In addition, Ray already supports env vars like RAY_NODE_ZONE and RAY_NODE_REGION that users can set

This means that labels-file and envs are incorrect APIs (especially label-file). We should make sure the Ray Autoscaler can easily determine which Pod will have which Ray labels from RayCluster CR.

cc @edoakes @jjyao please revisit the APIs so that we can avoid adding too much complexity in Ray Autoscaler.

Sep 30 '25 14:09 kevin85421

cc @edoakes @jjyao please revisit the APIs so that we can avoid adding too much complexity in Ray Autoscaler.

This is the core reason why we need to make some form of change in the KubeRay CR and can't do this entirely transparently based on rayStartParams. The proposal above to auto-populate the builtin set of ray.io/ labels from the pod spec labels is the most basic form of it that doesn't require adding a new API/field to the CR. The autoscaler will just scan the pod spec for those whitelisted labels.

This won't solve the problem for custom/user-specified labels though. For that, we'd need to introduce a more explicit API such as a labels field in the head/worker group spec itself. The idea is to start with the minimal thing listed above without requiring a CR change, and taking it from there.

The amount of complexity being added for any of the options discussed in this thread is very minor, and the feature will be alpha in both Ray and KubeRay before stabilizing, so there is no "one way door" and I see no harm in starting with the proposed option. If we want to jump directly to a top-level labels field (or comparable alternative), that's fine with me too. It has the benefit of being very explicit.

Sep 30 '25 14:09 edoakes

Agreed with everything @edoakes said, the level of complexity is pretty low and it's not a one way door.

If it helps address some concerns from others, we can also guard the default labeling with a feature gate so we can remove it / break it in the future, but I don't see it being likely / necessary since the feature it already opt-in using pod labels.

Sep 30 '25 15:09 andrewsykim

@edoakes thank you for the reply!

This won't solve the problem for custom/user-specified labels though.

What's "this" referring to?

For that, we'd need to introduce a more explicit API such as a labels field in the head/worker group spec itself.

Why we can't reuse rayStartParams? This doesn't require CRD change.

The idea is to start with the minimal thing listed above without requiring a CR change, and taking it from there.

Why it requires a CRD change? Can we reuse rayStartParams?

The amount of complexity being added for any of the options discussed in this thread is very minor

Have you read https://github.com/ray-project/kuberay/pull/3699#issuecomment-3348651942? The contract inconsistency between KubeRay / Ray Autoscaler was a real pain before. That's why I am very uncomfortable with the potential frequent contract changes with this PR if we incrementally auto-populate more and more labels.

Some solutions which provide stable contract between KubeRay and Ray Autoscaler I can accept:

Go with https://github.com/ray-project/kuberay/pull/3699#issuecomment-3347461719, and we promise not to incrementally add new default labels in the future.
Populate all Pod labels starting with ray.io into rayStartParams directly.
Don't change any code in KubeRay and ask users to set rayStartParams by themselves.

Sep 30 '25 18:09 kevin85421