kuberay icon indicating copy to clipboard operation
kuberay copied to clipboard

Question: Autoscaler v1 vs v2 configuration and performance

Open testinfected opened this issue 6 months ago • 39 comments

Hey guys,

I'm evaluating Ray on GKE with KubeRay for our company and have run a few benchmarks, so far it's very promising.

I've been running Ray 2.46 with Kuberay 1.3.2 on a GKE cluster with GKE autoscaling disabled, so all my nodes are up.

I'd like to benchmark autoscaler v1 and v2. The 1st batch of tests I run was using the following config:

autoscalerOptions:
  version: 2

I thought that would be enough to trigger usage of autoscaler v2, but looking at the autoscaler logs it's using Ray StandardAutoscaler and not the new Autoscaler apparently. Is is expected that you also have to set

headGroupSpec:
    template:
      spec:
        containers:
        - name: ray-head
          image: rayproject/ray:2.X.Y
          # Include the environment variable.
          env:
            - name: RAY_enable_autoscaler_v2
              value: "1"

as indicated in https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/configuring-autoscaling.html to enable the autoscaler v2?

I must confess I'm a bit confused by the documentation.

Also, from my initial tests with the v2, it scales much slower than the v1, and reports infeasible resource requests. Am I doing something wrong here?

Thanks in advance

testinfected avatar Jun 17 '25 15:06 testinfected

Hi @testinfected, thanks for reporting the issue!

Regarding this config:

autoscalerOptions:
  version: 2

I believe this will only be available starting in v1.4. Setting RAY_enable_autoscaler_v2 should work, you may also need to set restartPolicy to Never.

@rueian @davidxia @ryanaoleary any ideas?

andrewsykim avatar Jun 17 '25 16:06 andrewsykim

Thanks @andrewsykim for the quick reply, very much appreciated.

I will run a batch of tests with v1 vs v2 and share the results. I was expecting on par performance with the v2, but maybe it's too early? I see it's still considered beta

testinfected avatar Jun 17 '25 16:06 testinfected

I believe this will only be available starting in v1.4. Setting RAY_enable_autoscaler_v2 should work, you may also need to set restartPolicy to Never.

That's correct.

If you’re using KubeRay < 1.4.0, enable V2 by setting the RAY_enable_autoscaler_v2 environment variable in the head and using restartPolicy: Never on head and all worker groups.

— https://docs.ray.io/en/master/cluster/kubernetes/user-guides/configuring-autoscaling.html#:~:text=If%20you%E2%80%99re%20using%20KubeRay%20%3C%201.4.0%2C%20enable%20V2%20by%20setting%20the%20RAY_enable_autoscaler_v2%20environment%20variable%20in%20the%20head%20and%20using%20restartPolicy%3A%20Never%20on%20head%20and%20all%20worker%20groups

Also make sure to set

spec:
  enableInTreeAutoscaling: true

Lmk if that last part isn't clear from the docs, and we can emphasize.

davidxia avatar Jun 17 '25 17:06 davidxia

If you’re using KubeRay < 1.4.0, enable V2 by setting the RAY_enable_autoscaler_v2 environment variable in the head and using restartPolicy: Never on head and all worker groups.

— https://docs.ray.io/en/master/cluster/kubernetes/user-guides/configuring-autoscaling.html#:~:text=If%20you%E2%80%99re%20using%20KubeRay%20%3C%201.4.0%2C%20enable%20V2%20by%20setting%20the%20RAY_enable_autoscaler_v2%20environment%20variable%20in%20the%20head%20and%20using%20restartPolicy%3A%20Never%20on%20head%20and%20all%20worker%20groups

Also make sure to set

spec: enableInTreeAutoscaling: true Lmk if that last part isn't clear from the docs, and we can emphasize.

My bad, misread that part. All clear indeed.

Any idea how v1 vs v2 autoscaler expected to perform in terms of speed to scale up?

testinfected avatar Jun 17 '25 17:06 testinfected

@testinfected are you able to share some logs from both the v1 and v2 autoscaler for us to look at? I'm assuming everything else in the cluster is the same and you mentioned both clusters have the nodes pre-provisioned. Is that correct?

andrewsykim avatar Jun 17 '25 17:06 andrewsykim

@testinfected are you able to share some logs from both the v1 and v2 autoscaler for us to look at? I'm assuming everything else in the cluster is the same and you mentioned both clusters have the nodes pre-provisioned. Is that correct?

Yes correct. I will share my test results.

testinfected avatar Jun 17 '25 17:06 testinfected

Any idea how v1 vs v2 autoscaler expected to perform in terms of speed to scale up?

I think they should not have big differences. I just tested the two yamls: https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.autoscaler.yaml https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.autoscaler-v2.yaml

with a small workload that launches 5 actors simultaneously, and in both cases, the workers are ready in 0.3 seconds.

rueian avatar Jun 17 '25 17:06 rueian

Any idea how v1 vs v2 autoscaler expected to perform in terms of speed to scale up?

I think they should not have big differences. I just tested the two yamls: https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.autoscaler.yaml https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.autoscaler-v2.yaml

with a small workload that launches 5 actors simultaneously, and in both cases, the workers are ready in 0.3 seconds.

Are you sure you've tested with Autoscaler v2, using version: 2 only works with KubeRay 1.4.0, which I have confirmed

testinfected avatar Jun 27 '25 16:06 testinfected

The autoscaler v2 looks promising, although I'm still trying to understand its behavior as I've seen inconsistencies when running the same test several times.

It happens frequently that the instances requested stay in QUEUED state for several minutes! For example, look at that monitor log:

autoscaler.log

you will see that after the 200 instances are QUEUED, nothing happens for more than 4 whole minutes?!?

I'm using KubeRay 1.4.0, here's the cluster config (the job is a simple wait submit using the Job API):

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: autoscaling-cluster
spec:
  rayVersion: '2.46.0'
  enableInTreeAutoscaling: true
  # Configuration overrides for the autoscaler.
  autoscalerOptions:
    version: v2
    upscalingMode: Default
    idleTimeoutSeconds: 60
    imagePullPolicy: IfNotPresent
    securityContext: {}
    env: 
    - name: AUTOSCALER_MAX_CONCURRENT_LAUNCHES
      value: "1000"
    resources:
      limits:
        cpu: 1
        memory: 5Gi
      requests:
        cpu: 1
        memory: 5Gi
  headGroupSpec:
    serviceType: ClusterIP
    rayStartParams:
      dashboard-host: '0.0.0.0'
      # No actor or task should run on the head pod
      num-cpus: "0"
    template:
      metadata:
        labels: {}
      spec:
        containers:
        - name: ray-head
          image: rayproject/ray:2.46.0
          # Setting requests=limits is recommended with Ray. K8s limits are used for Ray-internal
          # resource accounting. K8s requests are not used by Ray.
          resources:
            limits:
              cpu: 25 
              memory: 20Gi
            requests:
              cpu: 25 
              memory: 20Gi
          ports:
          - containerPort: 6379
            name: gcs
          - containerPort: 8265
            name: dashboard
          - containerPort: 10001
            name: client
          # Refer to https://kubernetes.io/docs/concepts/storage/volumes/
          volumeMounts:
            - mountPath: /tmp/ray
              name: ray-logs
        # Refer to https://kubernetes.io/docs/concepts/storage/volumes/
        volumes:
          - name: ray-logs
            emptyDir: {}
  workerGroupSpecs:
  - replicas: 0
    minReplicas: 0
    maxReplicas: 350
    groupName: large-group
    rayStartParams: {}
    #pod template
    template:
      spec:
        containers:
        - name: ray-worker
          image: rayproject/ray:2.46.0
          resources:
            limits:
              cpu: 30 
              memory: 25Gi
            requests:
              cpu: 30 
              memory: 25Gi
          # Refer to https://kubernetes.io/docs/concepts/storage/volumes/
          volumeMounts:
            - mountPath: /tmp/ray
              name: ray-logs
        # Refer to https://kubernetes.io/docs/concepts/storage/volumes/
        volumes:
          - name: ray-logs
            emptyDir: {}

Any idea of what's happening during those 4 minutes?

Seems like if I run the test a second time, on the same Ray Cluster (only head pod running), the scaling happens almost immediately, but I do some more runs to be sure.

I have seen the same behavior with KubeRay 1.3.2 btw.

I wasn't able to find in the Ray code where the transition from QUEUED -> REQUESTED happens, and what is the trigger. Could you point me in the right direction?

Thanks

testinfected avatar Jun 27 '25 16:06 testinfected

Thanks for sharing @testinfected

@chiayi @ryanaoleary could you help take a look please?

andrewsykim avatar Jun 27 '25 17:06 andrewsykim

I wasn't able to find in the Ray code where the transition from QUEUED -> REQUESTED happens, and what is the trigger. Could you point me in the right direction?

I guess I found it finally: https://github.com/ray-project/ray/blob/7bead4f705c2d88dfb9b3ba0b5fd69db6fe56a6b/python/ray/autoscaler/v2/instance_manager/reconciler.py#L738 right?

Seems like the reconciliation loop is for some reason stuck during those 4 minutes, since we don't get any status update in the log (should be every AUTOSCALER_UPDATE_INTERVAL_S i.e. 5s).

testinfected avatar Jun 27 '25 17:06 testinfected

Seems like if I run the test a second time, on the same Ray Cluster (only head pod running), the scaling happens almost immediately, but I do some more runs to be sure.

I can confirm that whenever I restart the test run with the head pod still up I don't see that 4mn delay. Scaling pods starts right away.

One thing thay me be relevant is that starts with a scale request of 1 node, is stuck for 4 minutes, the does the scale request of 200. With a warm head node, I seed the 200 scale request right away.

testinfected avatar Jun 27 '25 20:06 testinfected

Hey guys, here's some additional info.

Rerun the same test on the same cluster, this time limiting the max concurrent launches to 50. You can see from the rampup and autoscaler log files that there's a wait time of several minutes before 2 scaling updates. This is consistent with the initial test were there was a 4 min delay between the initial request of 1 worker node and the scaling to 200 worker nodes.

autoscaler.log

rampup.txt

I've also attached the driver log file, you'll see there are some failures. One I see quite often is the assertion failed of the queue items.

job-driver.log

Thanks for looking into this.

testinfected avatar Jun 30 '25 18:06 testinfected

@testinfected Looking into this now, adding some logging around the _handle_instances_launch and update_instance_manager_state functions to see exactly where it might be hanging. Our initial thought is that this function could be failing due to the _instance_storage not updating for the large number of instances. I'll try to verify if this is where the issue is occurring through some additional logging.

Also do you still have access to the KubeRay logs when this issue was occuring? There's this line in the reconciler about instances being stuck in a QUEUED state and so it'd be good to verify that the controller isn't waiting on GKE for there to be suitable nodes (although I think you mentioned that eventually all Ray nodes scale up so this probably isn't the case).

ryanaoleary avatar Jul 01 '25 00:07 ryanaoleary

@testinfected Looking into this now, adding some logging around the _handle_instances_launch and update_instance_manager_state functions to see exactly where it might be hanging. Our initial thought is that this function could be failing due to the _instance_storage not updating for the large number of instances. I'll try to verify if this is where the issue is occurring through some additional logging.

Great, looking forward to what you can find

Also do you still have access to the KubeRay logs when this issue was occuring? There's this line in the reconciler about instances being stuck in a QUEUED state and so it'd be good to verify that the controller isn't waiting on GKE for there to be suitable nodes (although I think you mentioned that eventually all Ray nodes scale up so this probably isn't the case).

I've seen this error when I use an AUTOSCALER_MAX_CONCURRENT_LAUNCHES value small enough to have several scaling steps. That error doesn't show if I set to a value more than the required number of nodes (see log file attached).

kuberay.log

testinfected avatar Jul 01 '25 19:07 testinfected

That error doesn't show if I set to a value more than the required number of nodes (see log file attached).

Of course if the reconciliation loop is stuck - which seems to be the case, we won't see the error in the autoscaler logs.

testinfected avatar Jul 02 '25 13:07 testinfected

Got some additional info.

I'm still using the same test, which essentially looks like this:

import time
import ray

ray.init()

sleep_sec = 300
num_tasks = 3000

test_start = time.time()

@ray.remote(num_cpus=1)
def sleep_task(sleep_sec: int):
    time.sleep(sleep_sec)

res_refs = [sleep_task.remote(sleep_sec) for _ in range(num_tasks - 1)]
res = ray.get(res_refs)

test_end = time.time()
run_time = test_end - test_start

print(
    f"Scale test with {num_tasks} tasks ran in"
    + f" {run_time:.1f} secs.\n"
)

Running a 350 node pool of 15 CPUs per node avaible to Ray workers (so ~5k CPUs). If I run the test with num_tasks = 3000 I don't see a delay when scaling up. If I use 4000 tasks, then I start seeing the problem. In both cases I had set AUTOSCALER_MAX_CONCURRENT_LAUNCHES to 100 this time.

testinfected avatar Jul 07 '25 18:07 testinfected

We used py_spy on the process to see where it's stuck and here's what we got:

Python v3.9.22 (/home/ray/anaconda3/bin/python3.9)                                                                                                                     
                                                                                                                                                                       
Thread 1 (active+gil): "MainThread"                                                                                                                                    
    _try_schedule_one (ray/autoscaler/v2/scheduler.py:547)                                                                                                             
    try_schedule (ray/autoscaler/v2/scheduler.py:419)                                                                                                                  
    _sched_best_node (ray/autoscaler/v2/scheduler.py:1484)                                                                                                             
    _try_schedule (ray/autoscaler/v2/scheduler.py:1378)                                                                                                                
    _sched_resource_requests (ray/autoscaler/v2/scheduler.py:1238)                                                                                                     
    schedule (ray/autoscaler/v2/scheduler.py:871)                                                                                                                      
    _scale_cluster (ray/autoscaler/v2/instance_manager/reconciler.py:1114)                                                                                             
    _step_next (ray/autoscaler/v2/instance_manager/reconciler.py:274)                                                                                                  
    reconcile (ray/autoscaler/v2/instance_manager/reconciler.py:119)                                                                                                   
    update_autoscaling_state (ray/autoscaler/v2/autoscaler.py:184)                                                                                                     
    _run (ray/autoscaler/v2/monitor.py:169)                                                                                                                            
    run (ray/autoscaler/v2/monitor.py:182)                                                                                                                             
    run_kuberay_autoscaler (ray/autoscaler/_private/kuberay/run_autoscaler.py:69)                                                                                      
    kuberay_autoscaler (ray/scripts/scripts.py:2526)                                                                                                                   
    invoke (click/core.py:783)                                                                                                                                         
    invoke (click/core.py:1434)                                                                                                                                        
    invoke (click/core.py:1688)                                                                                                                                        
    main (click/core.py:1078)                                                                                                                                          
    __call__ (click/core.py:1157)                                                                                                                                      
    main (ray/scripts/scripts.py:2800)                                                                                                                                 
    <module> (ray:8)                                                                                                                                                   
Thread 188 (idle): "Thread-1"                                                                                                                                          
    select (selectors.py:416)                                                                                                                                          
    serve_forever (socketserver.py:232)                                                                                                                                
    run (threading.py:917)                                                                                                                                             
    _bootstrap_inner (threading.py:980)                                                                                                                                
    _bootstrap (threading.py:937)                                                                                                                                      

py_top gave us this:

Total Samples 400
GIL: 100.00%, Active: 100.00%, Threads: 3

  %Own   %Total  OwnTime  TotalTime  Function (filename)                                                                                                                                              
 51.00%  96.00%    2.06s     3.77s   _try_schedule_one (ray/autoscaler/v2/scheduler.py)
 15.00%  15.00%   0.720s    0.720s   __iter__ (_collections_abc.py)
 16.00%  16.00%   0.510s    0.510s   _fits (ray/autoscaler/_private/resource_demand_scheduler.py)
  4.00% 100.00%   0.200s     3.97s   try_schedule (ray/autoscaler/v2/scheduler.py)
  4.00%   6.00%   0.190s    0.240s   get_available_resources (ray/autoscaler/v2/scheduler.py)
  8.00%   8.00%   0.180s    0.230s   keys (_collections_abc.py)
  2.00%   2.00%   0.050s    0.050s   __hash__ (enum.py)
  0.00%   0.00%   0.050s    0.050s   __init__ (_collections_abc.py)
  0.00%   0.00%   0.020s    0.030s   message_to_dict (ray/_private/protobuf_compat.py)
  0.00%   0.00%   0.010s    0.010s   _RegularMessageToJsonObject (protobuf/json_format.py)
  0.00%   0.00%   0.010s    0.010s   _worker (concurrent/futures/thread.py)
  0.00%   0.00%   0.010s    0.010s   add_sched_request (ray/autoscaler/v2/scheduler.py)
  0.00%   0.00%   0.000s    0.010s   _bootstrap (threading.py)
  0.00% 100.00%   0.000s     4.00s   run_kuberay_autoscaler (ray/autoscaler/_private/kuberay/run_autoscaler.py)
  0.00%   0.00%   0.000s    0.010s   _MessageToJsonObject (protobuf/json_format.py)
  0.00% 100.00%   0.000s     4.00s   kuberay_autoscaler (ray/scripts/scripts.py)
  0.00% 100.00%   0.000s     4.00s   _sched_resource_requests (ray/autoscaler/v2/scheduler.py)
  0.00% 100.00%   0.000s     4.00s   _sched_best_node (ray/autoscaler/v2/scheduler.py)
  0.00% 100.00%   0.000s     4.00s   _scale_cluster (ray/autoscaler/v2/instance_manager/reconciler.py)
  0.00% 100.00%   0.000s     4.00s   _run (ray/autoscaler/v2/monitor.py)
  0.00%   0.00%   0.000s    0.010s   run (threading.py)
  0.00% 100.00%   0.000s     4.00s   invoke (click/core.py)
  0.00% 100.00%   0.000s     4.00s   <module> (ray)
  0.00% 100.00%   0.000s     4.00s   run (ray/autoscaler/v2/monitor.py)
  0.00% 100.00%   0.000s     4.00s   __call__ (click/core.py)
  0.00%   0.00%   0.000s    0.010s   MessageToDict (protobuf/json_format.py)
  0.00% 100.00%   0.000s     4.00s   schedule (ray/autoscaler/v2/scheduler.py)
  0.00%   0.00%   0.000s    0.030s   <listcomp> (ray/autoscaler/v2/utils.py)
  0.00% 100.00%   0.000s     4.00s   _step_next (ray/autoscaler/v2/instance_manager/reconciler.py)
  0.00% 100.00%   0.000s     4.00s   _try_schedule (ray/autoscaler/v2/scheduler.py)
  0.00% 100.00%   0.000s     4.00s   main (click/core.py)
  0.00%   0.00%   0.000s    0.030s   to_dict_list (ray/autoscaler/v2/utils.py)
  0.00%   0.00%   0.000s    0.030s   to_dict (ray/autoscaler/v2/utils.py)
  0.00% 100.00%   0.000s     4.00s   main (ray/scripts/scripts.py)
  0.00% 100.00%   0.000s     4.00s   reconcile (ray/autoscaler/v2/instance_manager/reconciler.py)
  0.00%   0.00%   0.000s    0.010s   _bootstrap_inner (threading.py)
  0.00% 100.00%   0.000s     4.00s   update_autoscaling_state (ray/autoscaler/v2/autoscaler.py)


Press Control-C to quit, or ? for help.

(base) ray@autoscaling-cluster-head:~$ py-spy top --pid 1

testinfected avatar Jul 14 '25 16:07 testinfected

Any idea of what's happening?

testinfected avatar Jul 14 '25 16:07 testinfected

@chiayi @ryanaoleary can you take a look at the stack trace in https://github.com/ray-project/kuberay/issues/3794#issuecomment-3070125687? cc @rueian as well

andrewsykim avatar Jul 15 '25 16:07 andrewsykim

Just did a run and it took a couple of minutes for the cluster to actually start scaling. I used that time to record a profile of the autoscaler process in speedscope format.

profile.speedscope.json

testinfected avatar Jul 15 '25 19:07 testinfected

Yeah like you said additional logging around _try_schedule just showed that the reconciler was stuck after "launching" (QUEUED->REQUESTED) one node and before transitioning each instance to the next state. I replicated the same issue in my cluster following these steps:

  1. Setup a GKE autopilot cluster and install KubeRay v1.4
  2. Create a RayCluster with the following spec:
apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: autoscaling-cluster
spec:
  # rayVersion: '2.46.0'
  enableInTreeAutoscaling: true
  # Configuration overrides for the autoscaler.
  autoscalerOptions:
    version: v2
    upscalingMode: Default
    idleTimeoutSeconds: 60
    imagePullPolicy: IfNotPresent
    securityContext: {}
    env: 
    - name: AUTOSCALER_MAX_CONCURRENT_LAUNCHES
      value: "100"
    resources:
      limits:
        cpu: 1
        memory: 5Gi
      requests:
        cpu: 1
        memory: 5Gi
  headGroupSpec:
    serviceType: ClusterIP
    rayStartParams:
      dashboard-host: '0.0.0.0'
      # No actor or task should run on the head pod
      num-cpus: "0"
    template:
      metadata:
        labels: {}
      spec:
        containers:
        - name: ray-head
          image: us-central1-docker.pkg.dev/ryanaoleary-gke-dev/ryanaoleary-ray/ray
          # Setting requests=limits is recommended with Ray. K8s limits are used for Ray-internal
          # resource accounting. K8s requests are not used by Ray.
          resources:
            limits:
              cpu: 25 
              memory: 20Gi
            requests:
              cpu: 25 
              memory: 20Gi
          ports:
          - containerPort: 6379
            name: gcs
          - containerPort: 8265
            name: dashboard
          - containerPort: 10001
            name: client
          # Refer to https://kubernetes.io/docs/concepts/storage/volumes/
          volumeMounts:
            - mountPath: /tmp/ray
              name: ray-logs
        # Refer to https://kubernetes.io/docs/concepts/storage/volumes/
        volumes:
          - name: ray-logs
            emptyDir: {}
  workerGroupSpecs:
  - replicas: 0
    minReplicas: 0
    maxReplicas: 350
    groupName: large-group
    rayStartParams: {}
    #pod template
    template:
      spec:
        containers:
        - name: ray-worker
          image: us-central1-docker.pkg.dev/ryanaoleary-gke-dev/ryanaoleary-ray/ray
          resources:
            limits:
              cpu: 15
              memory: 25Gi
            requests:
              cpu: 15
              memory: 25Gi
          # Refer to https://kubernetes.io/docs/concepts/storage/volumes/
          volumeMounts:
            - mountPath: /tmp/ray
              name: ray-logs
        # Refer to https://kubernetes.io/docs/concepts/storage/volumes/
        volumes:
          - name: ray-logs
            emptyDir: {}
kubectl apply -f ray-test-upscale.yaml

RayCluster created:

kubectl get rayclusters
NAME                  DESIRED WORKERS   AVAILABLE WORKERS   CPUS   MEMORY   GPUS   STATUS   AGE
autoscaling-cluster                                         25     20Gi     0               5m41s
  1. Create the following test python script and submit as a Ray job:
import time
import ray

ray.init()

sleep_sec = 300
num_tasks = 4000

test_start = time.time()

@ray.remote(num_cpus=1)
def sleep_task(sleep_sec: int):
    time.sleep(sleep_sec)

res_refs = [sleep_task.remote(sleep_sec) for _ in range(num_tasks - 1)]
res = ray.get(res_refs)

test_end = time.time()
run_time = test_end - test_start

print(
    f"Scale test with {num_tasks} tasks ran in"
    + f" {run_time:.1f} secs.\n"
)
kubectl port-forward svc/autoscaling-cluster-head-svc 8265:8265 2>&1 >/dev/null &

ray job submit --submission-id upscale-test-job-1 --working-dir . --runtime-env-json='{"excludes": ["myenv"]}' --address http://localhost:8265 -- python upscale-test.py

Job submitted and running:

Image

Logs show a hang for ~5 minutes after scaling up 1 Pod (REQUESTED->ALLOCATED state for cloud instance):

2025-07-16 22:47:49,395 - INFO - New instance QUEUED (id=900f8ce7-ba70-4c87-8ffb-17ab36eb2c67, type=large-group, cloud_instance_id=, ray_id=): queuing new instance of large-group from scheduler
2025-07-16 22:47:49,395	INFO instance_manager.py:246 -- New instance QUEUED (id=900f8ce7-ba70-4c87-8ffb-17ab36eb2c67, type=large-group, cloud_instance_id=, ray_id=): queuing new instance of large-group from scheduler
2025-07-16 22:47:49,400 - INFO - Update instance QUEUED->REQUESTED (id=22278c2e-23de-4ccd-9205-fe5f53013886, type=large-group, cloud_instance_id=, ray_id=): requested to launch large-group with request id ea0eef16-7c2a-4330-b4f5-815291c071d7
2025-07-16 22:47:49,400	INFO instance_manager.py:262 -- Update instance QUEUED->REQUESTED (id=22278c2e-23de-4ccd-9205-fe5f53013886, type=large-group, cloud_instance_id=, ray_id=): requested to launch large-group with request id ea0eef16-7c2a-4330-b4f5-815291c071d7
2025-07-16 22:47:49,401 - INFO - Refreshing K8s API client token and certs.
2025-07-16 22:47:49,401	INFO node_provider.py:277 -- Refreshing K8s API client token and certs.
2025-07-16 22:47:49,448 - INFO - Listing pods for RayCluster autoscaling-cluster in namespace default at pods resource version >= 1752716816740607021.
2025-07-16 22:47:49,448	INFO cloud_provider.py:463 -- Listing pods for RayCluster autoscaling-cluster in namespace default at pods resource version >= 1752716816740607021.
2025-07-16 22:47:49,458 - INFO - Fetched pod data at resource version 1752731268950348000.
2025-07-16 22:47:49,458	INFO cloud_provider.py:481 -- Fetched pod data at resource version 1752731268950348000.
2025-07-16 22:47:49,459 - INFO - Submitting a scale request: KubeRayProvider.ScaleRequest(desired_num_workers=defaultdict(<class 'int'>, {'large-group': 1}), workers_to_delete=defaultdict(<class 'list'>, {}), worker_groups_without_pending_deletes=set(), worker_groups_with_pending_deletes=set())
2025-07-16 22:47:49,459	INFO cloud_provider.py:331 -- Submitting a scale request: KubeRayProvider.ScaleRequest(desired_num_workers=defaultdict(<class 'int'>, {'large-group': 1}), workers_to_delete=defaultdict(<class 'list'>, {}), worker_groups_without_pending_deletes=set(), worker_groups_with_pending_deletes=set())
2025-07-16 22:47:54,519 - INFO - Refreshing K8s API client token and certs.
2025-07-16 22:47:54,519	INFO node_provider.py:277 -- Refreshing K8s API client token and certs.
2025-07-16 22:47:54,561 - INFO - Calculating hashes for file mounts and ray commands.
2025-07-16 22:47:54,561	INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-07-16 22:47:54,602 - INFO - Listing pods for RayCluster autoscaling-cluster in namespace default at pods resource version >= 1752716816740607021.
2025-07-16 22:47:54,602	INFO cloud_provider.py:463 -- Listing pods for RayCluster autoscaling-cluster in namespace default at pods resource version >= 1752716816740607021.
2025-07-16 22:47:54,611 - INFO - Fetched pod data at resource version 1752731271151951001.
2025-07-16 22:47:54,611	INFO cloud_provider.py:481 -- Fetched pod data at resource version 1752731271151951001.
2025-07-16 22:47:54,614 - INFO - Update instance REQUESTED->ALLOCATED (id=22278c2e-23de-4ccd-9205-fe5f53013886, type=large-group, cloud_instance_id=, ray_id=): allocated unassigned cloud instance autoscaling-cluster-large-group-worker-5dcdl
2025-07-16 22:47:54,614	INFO instance_manager.py:262 -- Update instance REQUESTED->ALLOCATED (id=22278c2e-23de-4ccd-9205-fe5f53013886, type=large-group, cloud_instance_id=, ray_id=): allocated unassigned cloud instance autoscaling-cluster-large-group-worker-5dcdl
2025-07-16 22:52:45,597 - INFO - Update instance QUEUED->REQUESTED (id=4e21d714-dd14-4e85-aef6-850c588878d5, type=large-group, cloud_instance_id=, ray_id=): requested to launch large-group with request id 80ea18b4-e8a5-42b1-80ad-3bab0b35c2b6
2025-07-16 22:52:45,597	INFO instance_manager.py:262 -- Update instance QUEUED->REQUESTED (id=4e21d714-dd14-4e85-aef6-850c588878d5, type=large-group, cloud_instance_id=, ray_id=): requested to launch large-group with request id 80ea18b4-e8a5-42b1-80ad-3bab0b35c2b6
2025-07-16 22:52:45,597 - INFO - Update instance QUEUED->REQUESTED (id=ad9a3a4c-c8b7-4c17-bc11-8cdcd2ee323b, type=large-group, cloud_instance_id=, ray_id=): requested to launch large-group with request id 80ea18b4-e8a5-42b1-80ad-3bab0b35c2b6
2025-07-16 22:52:45,597	INFO instance_manager.py:262 -- Update instance QUEUED->REQUESTED (id=ad9a3a4c-c8b7-4c17-bc11-8cdcd2ee323b, type=large-group, cloud_instance_id=, ray_id=): requested to launch large-group with request id 80ea18b4-e8a5-42b1-80ad-3bab0b35c2b6

I think the issue is this block of code that causes the singular initial node to launch and allocate a Pod, waiting another iteration before scaling AUTOSCALER_MAX_CONCURRENT_LAUNCHES and hanging here as the large number of requests each try to schedule on the one existing node (which is supported by the py_top profile which shows the large amount of time spent here):

num_desired_to_upscale = max(
  1,
  math.ceil(
    upscaling_speed
    * (
       len(requested_instances_for_type)
       + len(allocated_instances_for_type)
      )
  ),
)

Since on a cold start the autoscaler will always compute to launch one node, we end up a lot of time trying to fit any of the other 3999 requests on this one node (I believe with O(N² * M)). This could explain why the long hang occurs for 4000 tasks but not for 3000, with the autoscaler's 1 CPU getting overwhelmed by the polynomial computation.

An easy mitigation for this issue would just be to increase the amount of CPU for the autoscaler container:

  autoscalerOptions:
    version: v2
    upscalingMode: Default
    idleTimeoutSeconds: 60
    imagePullPolicy: Always
    securityContext: {}
    env: 
    - name: AUTOSCALER_MAX_CONCURRENT_LAUNCHES
      value: "100"
    resources:
      limits:
        cpu: 12
        memory: 5Gi
      requests:
        cpu: 12
        memory: 5Gi

with the above spec and the same manual testing steps I did not encounter the ~5 minute hang. Besides the above mitigation, I think the way num_desired_to_upscale is calculated is incorrect, since limiting the initial scaling event to 1 instance doesn't make sense for Default or Aggressive upscalingMode. I just created a PR with a fix: https://github.com/ray-project/ray/pull/54699

I did not encounter the hang on cold-start when using this image: us-central1-docker.pkg.dev/ryanaoleary-gke-dev/ryanaoleary-ray/ray:fix-upscaling

# some logging I added
2025-07-17 01:12:19,116 - INFO - New instance QUEUED (id=b0326768-c385-48da-9c41-031332c70c28, type=large-group, cloud_instance_id=, ray_id=): queuing new instance of large-group from scheduler
2025-07-17 01:12:19,116	INFO instance_manager.py:247 -- New instance QUEUED (id=b0326768-c385-48da-9c41-031332c70c28, type=large-group, cloud_instance_id=, ray_id=): queuing new instance of large-group from scheduler
2025-07-17 01:12:19,117 - INFO - Computing instances to launch. State: 100 queued, 0 running, upscaling_speed: 1000.num_desired_to_upscale: 1000.Max concurrent launches: 100.
2025-07-17 01:12:19,117	INFO reconciler.py:860 -- Computing instances to launch. State: 100 queued, 0 running, upscaling_speed: 1000.num_desired_to_upscale: 1000.Max concurrent launches: 100.
2025-07-17 01:12:19,118 - INFO - Update instance QUEUED->REQUESTED (id=d2c48cd3-7171-46f1-89e9-9a2113552700, type=large-group, cloud_instance_id=, ray_id=): requested to launch large-group with request id c351c604-219a-496c-8f85-22f7856dc554
2025-07-17 01:12:19,118	INFO instance_manager.py:263 -- Update instance QUEUED->REQUESTED (id=d2c48cd3-7171-46f1-89e9-9a2113552700, type=large-group, cloud_instance_id=, ray_id=): requested to launch large-group with request id c351c604-219a-496c-8f85-22f7856dc554
2025-07-17 01:12:19,118 - INFO - Update instance QUEUED->REQUESTED (id=9cb496af-85f4-4424-84d7-cec769940b8d, type=large-group, cloud_instance_id=, ray_id=): requested to launch large-group with request id c351c604-219a-496c-8f85-22f7856dc554
2025-07-17 01:12:19,118	INFO instance_manager.py:263 -- Update instance QUEUED->REQUESTED (id=9cb496af-85f4-4424-84d7-cec769940b8d, type=large-group, cloud_instance_id=, ray_id=): requested to launch large-group with request id c351c604-219a-496c-8f85-22f7856dc554

ryanaoleary avatar Jul 17 '25 08:07 ryanaoleary

I think another part of the issue besides the initial fix I included above is that resource requests passed to try_schedule and iterated over here should be grouped when identical. This would reduce the time complexity of trying to fit resource requests on in-flight nodes, since some logging I added is currently showing:

2025-07-17 01:12:14,935 - INFO - Fetched pod data at resource version 1752739930501757000.
2025-07-17 01:12:14,935	INFO cloud_provider.py:484 -- Fetched pod data at resource version 1752739930501757000.
2025-07-17 01:12:14,960 - INFO - Trying to schedule # resource requests: 3999
2025-07-17 01:12:14,960	INFO scheduler.py:425 -- Trying to schedule # resource requests: 3999
2025-07-17 01:12:14,960 - INFO - Starting try schedule

for each node. I think we can add a check here to first group requests by shape and then check if that shape fits (rather than each individual request) on a given node. This should optimize the amount of calls to _try_schedule_one which are causing the autoscaler to hang. What do you think @rueian ?

ryanaoleary avatar Jul 17 '25 09:07 ryanaoleary

Great findings @ryanaoleary. Your explanation makes perfect sense, thanks.

As I mentioned in my first message we're currently evaluating Ray for running large scale AI workloads. So having a resolution to this issue is really good news.

FYI a critical blocker for us right now is https://github.com/ray-project/ray/issues/54321. If you have any idea of what could be causing that other issue, that would be great!

testinfected avatar Jul 17 '25 13:07 testinfected

Sounds good I'll start taking a look at that issue as well. Another comment on the autoscaler issue, I believe there's a known issue where setting the Kubernetes Pod resources.limits leads to degraded performance. I haven't tested it yet but removing the limits block and just leaving requests under autoscalerOptions.resources could help improve the performance seen.

ryanaoleary avatar Jul 17 '25 17:07 ryanaoleary

I suggested adding a way to set debug mode on the asyncio event loop to help understand what's happening. See https://github.com/ray-project/ray/issues/54321#issuecomment-3062707228

testinfected avatar Jul 17 '25 20:07 testinfected

Another question: do you have a more comprehensive developer documentation than what's available here: https://docs.ray.io/en/latest/ray-contribute/development.html

I'd like to be able to build a Ray Docker image locally to help with troubleshooting. It's not clear to me after reading the docs and looking at the code what's required to build an image after making some changes to the python code.

Thanks!

testinfected avatar Jul 17 '25 20:07 testinfected

Another question: do you have a more comprehensive developer documentation than what's available here: https://docs.ray.io/en/latest/ray-contribute/development.html

I'd like to be able to build a Ray Docker image locally to help with troubleshooting. It's not clear to me after reading the docs and looking at the code what's required to build an image after making some changes to the python code.

Thanks!

@testinfected I don't see a guide currently in the Ray docs for this, but these are my steps for building a dev image to test with:

  1. I follow the steps here to setup a virtual environment. I also usually install these dependencies and a Ray dev image following these steps: https://docs.ray.io/en/latest/ray-contribute/development.html#installing-additional-dependencies-for-development so that I can run tests.
  2. Create a Dockerfile like this (one currently exists here):
# Use the latest Ray master as base.
FROM rayproject/ray:nightly-py310
# Invalidate the cache so that fresh code is pulled in the next step.
ARG BUILD_DATE
# Retrieve your development code.
ADD . ray
# Install symlinks to your modified Python code.
RUN python ray/python/ray/setup-dev.py -y
  1. Run a script with the following to build/push the image. Make sure your dev branch is up-to-date with the latest changes from master before building the image (since it uses your local python code):
set -x
RAY_IMAGE=<docker repo you want to push to and image tag>
docker image rm $RAY_IMAGE

pushd ../../../..
docker build --progress=plain --build-arg BUILD_DATE="$(date +%Y-%m-%d:%H:%M:%S)" -t $RAY_IMAGE -f ./<YOUR_FOLDER>/Dockerfile . || exit
popd || exit

docker image push $RAY_IMAGE

I then replace the image used in my RayCluster with the one I built. I've ran into issues with the dev image crash looping the head Pod after being built, when I ran into it before it was due to the python dependencies getting messed up (not sure how) and I believe I fixed it by changing the python version to FROM rayproject/ray:nightly-py311, although py310 is currently working for me.

ryanaoleary avatar Jul 17 '25 21:07 ryanaoleary

Ok will try that thanks

testinfected avatar Jul 17 '25 21:07 testinfected

Thanks for the investigation and @testinfected for pointing me here. When would the fix above be available? It was mentioned that this is already in the nightly build but I was thinking like v2.49 or or something?

SheldonTsen avatar Aug 29 '25 08:08 SheldonTsen