Question: Autoscaler v1 vs v2 configuration and performance
Hey guys,
I'm evaluating Ray on GKE with KubeRay for our company and have run a few benchmarks, so far it's very promising.
I've been running Ray 2.46 with Kuberay 1.3.2 on a GKE cluster with GKE autoscaling disabled, so all my nodes are up.
I'd like to benchmark autoscaler v1 and v2. The 1st batch of tests I run was using the following config:
autoscalerOptions:
version: 2
I thought that would be enough to trigger usage of autoscaler v2, but looking at the autoscaler logs it's using Ray StandardAutoscaler and not the new Autoscaler apparently. Is is expected that you also have to set
headGroupSpec:
template:
spec:
containers:
- name: ray-head
image: rayproject/ray:2.X.Y
# Include the environment variable.
env:
- name: RAY_enable_autoscaler_v2
value: "1"
as indicated in https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/configuring-autoscaling.html to enable the autoscaler v2?
I must confess I'm a bit confused by the documentation.
Also, from my initial tests with the v2, it scales much slower than the v1, and reports infeasible resource requests. Am I doing something wrong here?
Thanks in advance
Hi @testinfected, thanks for reporting the issue!
Regarding this config:
autoscalerOptions:
version: 2
I believe this will only be available starting in v1.4. Setting RAY_enable_autoscaler_v2 should work, you may also need to set restartPolicy to Never.
@rueian @davidxia @ryanaoleary any ideas?
Thanks @andrewsykim for the quick reply, very much appreciated.
I will run a batch of tests with v1 vs v2 and share the results. I was expecting on par performance with the v2, but maybe it's too early? I see it's still considered beta
I believe this will only be available starting in v1.4. Setting
RAY_enable_autoscaler_v2should work, you may also need to setrestartPolicytoNever.
That's correct.
If you’re using KubeRay < 1.4.0, enable V2 by setting the
RAY_enable_autoscaler_v2environment variable in the head and usingrestartPolicy: Neveron head and all worker groups.
— https://docs.ray.io/en/master/cluster/kubernetes/user-guides/configuring-autoscaling.html#:~:text=If%20you%E2%80%99re%20using%20KubeRay%20%3C%201.4.0%2C%20enable%20V2%20by%20setting%20the%20RAY_enable_autoscaler_v2%20environment%20variable%20in%20the%20head%20and%20using%20restartPolicy%3A%20Never%20on%20head%20and%20all%20worker%20groups
Also make sure to set
spec:
enableInTreeAutoscaling: true
Lmk if that last part isn't clear from the docs, and we can emphasize.
If you’re using KubeRay < 1.4.0, enable V2 by setting the
RAY_enable_autoscaler_v2environment variable in the head and usingrestartPolicy: Neveron head and all worker groups.— https://docs.ray.io/en/master/cluster/kubernetes/user-guides/configuring-autoscaling.html#:~:text=If%20you%E2%80%99re%20using%20KubeRay%20%3C%201.4.0%2C%20enable%20V2%20by%20setting%20the%20RAY_enable_autoscaler_v2%20environment%20variable%20in%20the%20head%20and%20using%20restartPolicy%3A%20Never%20on%20head%20and%20all%20worker%20groups
Also make sure to set
spec: enableInTreeAutoscaling: true Lmk if that last part isn't clear from the docs, and we can emphasize.
My bad, misread that part. All clear indeed.
Any idea how v1 vs v2 autoscaler expected to perform in terms of speed to scale up?
@testinfected are you able to share some logs from both the v1 and v2 autoscaler for us to look at? I'm assuming everything else in the cluster is the same and you mentioned both clusters have the nodes pre-provisioned. Is that correct?
@testinfected are you able to share some logs from both the v1 and v2 autoscaler for us to look at? I'm assuming everything else in the cluster is the same and you mentioned both clusters have the nodes pre-provisioned. Is that correct?
Yes correct. I will share my test results.
Any idea how v1 vs v2 autoscaler expected to perform in terms of speed to scale up?
I think they should not have big differences. I just tested the two yamls: https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.autoscaler.yaml https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.autoscaler-v2.yaml
with a small workload that launches 5 actors simultaneously, and in both cases, the workers are ready in 0.3 seconds.
Any idea how v1 vs v2 autoscaler expected to perform in terms of speed to scale up?
I think they should not have big differences. I just tested the two yamls: https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.autoscaler.yaml https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.autoscaler-v2.yaml
with a small workload that launches 5 actors simultaneously, and in both cases, the workers are ready in 0.3 seconds.
Are you sure you've tested with Autoscaler v2, using version: 2 only works with KubeRay 1.4.0, which I have confirmed
The autoscaler v2 looks promising, although I'm still trying to understand its behavior as I've seen inconsistencies when running the same test several times.
It happens frequently that the instances requested stay in QUEUED state for several minutes! For example, look at that monitor log:
you will see that after the 200 instances are QUEUED, nothing happens for more than 4 whole minutes?!?
I'm using KubeRay 1.4.0, here's the cluster config (the job is a simple wait submit using the Job API):
apiVersion: ray.io/v1
kind: RayCluster
metadata:
name: autoscaling-cluster
spec:
rayVersion: '2.46.0'
enableInTreeAutoscaling: true
# Configuration overrides for the autoscaler.
autoscalerOptions:
version: v2
upscalingMode: Default
idleTimeoutSeconds: 60
imagePullPolicy: IfNotPresent
securityContext: {}
env:
- name: AUTOSCALER_MAX_CONCURRENT_LAUNCHES
value: "1000"
resources:
limits:
cpu: 1
memory: 5Gi
requests:
cpu: 1
memory: 5Gi
headGroupSpec:
serviceType: ClusterIP
rayStartParams:
dashboard-host: '0.0.0.0'
# No actor or task should run on the head pod
num-cpus: "0"
template:
metadata:
labels: {}
spec:
containers:
- name: ray-head
image: rayproject/ray:2.46.0
# Setting requests=limits is recommended with Ray. K8s limits are used for Ray-internal
# resource accounting. K8s requests are not used by Ray.
resources:
limits:
cpu: 25
memory: 20Gi
requests:
cpu: 25
memory: 20Gi
ports:
- containerPort: 6379
name: gcs
- containerPort: 8265
name: dashboard
- containerPort: 10001
name: client
# Refer to https://kubernetes.io/docs/concepts/storage/volumes/
volumeMounts:
- mountPath: /tmp/ray
name: ray-logs
# Refer to https://kubernetes.io/docs/concepts/storage/volumes/
volumes:
- name: ray-logs
emptyDir: {}
workerGroupSpecs:
- replicas: 0
minReplicas: 0
maxReplicas: 350
groupName: large-group
rayStartParams: {}
#pod template
template:
spec:
containers:
- name: ray-worker
image: rayproject/ray:2.46.0
resources:
limits:
cpu: 30
memory: 25Gi
requests:
cpu: 30
memory: 25Gi
# Refer to https://kubernetes.io/docs/concepts/storage/volumes/
volumeMounts:
- mountPath: /tmp/ray
name: ray-logs
# Refer to https://kubernetes.io/docs/concepts/storage/volumes/
volumes:
- name: ray-logs
emptyDir: {}
Any idea of what's happening during those 4 minutes?
Seems like if I run the test a second time, on the same Ray Cluster (only head pod running), the scaling happens almost immediately, but I do some more runs to be sure.
I have seen the same behavior with KubeRay 1.3.2 btw.
I wasn't able to find in the Ray code where the transition from QUEUED -> REQUESTED happens, and what is the trigger. Could you point me in the right direction?
Thanks
Thanks for sharing @testinfected
@chiayi @ryanaoleary could you help take a look please?
I wasn't able to find in the Ray code where the transition from QUEUED -> REQUESTED happens, and what is the trigger. Could you point me in the right direction?
I guess I found it finally: https://github.com/ray-project/ray/blob/7bead4f705c2d88dfb9b3ba0b5fd69db6fe56a6b/python/ray/autoscaler/v2/instance_manager/reconciler.py#L738 right?
Seems like the reconciliation loop is for some reason stuck during those 4 minutes, since we don't get any status update in the log (should be every AUTOSCALER_UPDATE_INTERVAL_S i.e. 5s).
Seems like if I run the test a second time, on the same Ray Cluster (only head pod running), the scaling happens almost immediately, but I do some more runs to be sure.
I can confirm that whenever I restart the test run with the head pod still up I don't see that 4mn delay. Scaling pods starts right away.
One thing thay me be relevant is that starts with a scale request of 1 node, is stuck for 4 minutes, the does the scale request of 200. With a warm head node, I seed the 200 scale request right away.
Hey guys, here's some additional info.
Rerun the same test on the same cluster, this time limiting the max concurrent launches to 50. You can see from the rampup and autoscaler log files that there's a wait time of several minutes before 2 scaling updates. This is consistent with the initial test were there was a 4 min delay between the initial request of 1 worker node and the scaling to 200 worker nodes.
I've also attached the driver log file, you'll see there are some failures. One I see quite often is the assertion failed of the queue items.
Thanks for looking into this.
@testinfected Looking into this now, adding some logging around the _handle_instances_launch and update_instance_manager_state functions to see exactly where it might be hanging. Our initial thought is that this function could be failing due to the _instance_storage not updating for the large number of instances. I'll try to verify if this is where the issue is occurring through some additional logging.
Also do you still have access to the KubeRay logs when this issue was occuring? There's this line in the reconciler about instances being stuck in a QUEUED state and so it'd be good to verify that the controller isn't waiting on GKE for there to be suitable nodes (although I think you mentioned that eventually all Ray nodes scale up so this probably isn't the case).
@testinfected Looking into this now, adding some logging around the
_handle_instances_launchandupdate_instance_manager_statefunctions to see exactly where it might be hanging. Our initial thought is that this function could be failing due to the_instance_storagenot updating for the large number of instances. I'll try to verify if this is where the issue is occurring through some additional logging.
Great, looking forward to what you can find
Also do you still have access to the KubeRay logs when this issue was occuring? There's this line in the reconciler about instances being stuck in a
QUEUEDstate and so it'd be good to verify that the controller isn't waiting on GKE for there to be suitable nodes (although I think you mentioned that eventually all Ray nodes scale up so this probably isn't the case).
I've seen this error when I use an AUTOSCALER_MAX_CONCURRENT_LAUNCHES value small enough to have several scaling steps. That error doesn't show if I set to a value more than the required number of nodes (see log file attached).
That error doesn't show if I set to a value more than the required number of nodes (see log file attached).
Of course if the reconciliation loop is stuck - which seems to be the case, we won't see the error in the autoscaler logs.
Got some additional info.
I'm still using the same test, which essentially looks like this:
import time
import ray
ray.init()
sleep_sec = 300
num_tasks = 3000
test_start = time.time()
@ray.remote(num_cpus=1)
def sleep_task(sleep_sec: int):
time.sleep(sleep_sec)
res_refs = [sleep_task.remote(sleep_sec) for _ in range(num_tasks - 1)]
res = ray.get(res_refs)
test_end = time.time()
run_time = test_end - test_start
print(
f"Scale test with {num_tasks} tasks ran in"
+ f" {run_time:.1f} secs.\n"
)
Running a 350 node pool of 15 CPUs per node avaible to Ray workers (so ~5k CPUs). If I run the test with num_tasks = 3000 I don't see a delay when scaling up. If I use 4000 tasks, then I start seeing the problem. In both cases I had set AUTOSCALER_MAX_CONCURRENT_LAUNCHES to 100 this time.
We used py_spy on the process to see where it's stuck and here's what we got:
Python v3.9.22 (/home/ray/anaconda3/bin/python3.9)
Thread 1 (active+gil): "MainThread"
_try_schedule_one (ray/autoscaler/v2/scheduler.py:547)
try_schedule (ray/autoscaler/v2/scheduler.py:419)
_sched_best_node (ray/autoscaler/v2/scheduler.py:1484)
_try_schedule (ray/autoscaler/v2/scheduler.py:1378)
_sched_resource_requests (ray/autoscaler/v2/scheduler.py:1238)
schedule (ray/autoscaler/v2/scheduler.py:871)
_scale_cluster (ray/autoscaler/v2/instance_manager/reconciler.py:1114)
_step_next (ray/autoscaler/v2/instance_manager/reconciler.py:274)
reconcile (ray/autoscaler/v2/instance_manager/reconciler.py:119)
update_autoscaling_state (ray/autoscaler/v2/autoscaler.py:184)
_run (ray/autoscaler/v2/monitor.py:169)
run (ray/autoscaler/v2/monitor.py:182)
run_kuberay_autoscaler (ray/autoscaler/_private/kuberay/run_autoscaler.py:69)
kuberay_autoscaler (ray/scripts/scripts.py:2526)
invoke (click/core.py:783)
invoke (click/core.py:1434)
invoke (click/core.py:1688)
main (click/core.py:1078)
__call__ (click/core.py:1157)
main (ray/scripts/scripts.py:2800)
<module> (ray:8)
Thread 188 (idle): "Thread-1"
select (selectors.py:416)
serve_forever (socketserver.py:232)
run (threading.py:917)
_bootstrap_inner (threading.py:980)
_bootstrap (threading.py:937)
py_top gave us this:
Total Samples 400
GIL: 100.00%, Active: 100.00%, Threads: 3
%Own %Total OwnTime TotalTime Function (filename)
51.00% 96.00% 2.06s 3.77s _try_schedule_one (ray/autoscaler/v2/scheduler.py)
15.00% 15.00% 0.720s 0.720s __iter__ (_collections_abc.py)
16.00% 16.00% 0.510s 0.510s _fits (ray/autoscaler/_private/resource_demand_scheduler.py)
4.00% 100.00% 0.200s 3.97s try_schedule (ray/autoscaler/v2/scheduler.py)
4.00% 6.00% 0.190s 0.240s get_available_resources (ray/autoscaler/v2/scheduler.py)
8.00% 8.00% 0.180s 0.230s keys (_collections_abc.py)
2.00% 2.00% 0.050s 0.050s __hash__ (enum.py)
0.00% 0.00% 0.050s 0.050s __init__ (_collections_abc.py)
0.00% 0.00% 0.020s 0.030s message_to_dict (ray/_private/protobuf_compat.py)
0.00% 0.00% 0.010s 0.010s _RegularMessageToJsonObject (protobuf/json_format.py)
0.00% 0.00% 0.010s 0.010s _worker (concurrent/futures/thread.py)
0.00% 0.00% 0.010s 0.010s add_sched_request (ray/autoscaler/v2/scheduler.py)
0.00% 0.00% 0.000s 0.010s _bootstrap (threading.py)
0.00% 100.00% 0.000s 4.00s run_kuberay_autoscaler (ray/autoscaler/_private/kuberay/run_autoscaler.py)
0.00% 0.00% 0.000s 0.010s _MessageToJsonObject (protobuf/json_format.py)
0.00% 100.00% 0.000s 4.00s kuberay_autoscaler (ray/scripts/scripts.py)
0.00% 100.00% 0.000s 4.00s _sched_resource_requests (ray/autoscaler/v2/scheduler.py)
0.00% 100.00% 0.000s 4.00s _sched_best_node (ray/autoscaler/v2/scheduler.py)
0.00% 100.00% 0.000s 4.00s _scale_cluster (ray/autoscaler/v2/instance_manager/reconciler.py)
0.00% 100.00% 0.000s 4.00s _run (ray/autoscaler/v2/monitor.py)
0.00% 0.00% 0.000s 0.010s run (threading.py)
0.00% 100.00% 0.000s 4.00s invoke (click/core.py)
0.00% 100.00% 0.000s 4.00s <module> (ray)
0.00% 100.00% 0.000s 4.00s run (ray/autoscaler/v2/monitor.py)
0.00% 100.00% 0.000s 4.00s __call__ (click/core.py)
0.00% 0.00% 0.000s 0.010s MessageToDict (protobuf/json_format.py)
0.00% 100.00% 0.000s 4.00s schedule (ray/autoscaler/v2/scheduler.py)
0.00% 0.00% 0.000s 0.030s <listcomp> (ray/autoscaler/v2/utils.py)
0.00% 100.00% 0.000s 4.00s _step_next (ray/autoscaler/v2/instance_manager/reconciler.py)
0.00% 100.00% 0.000s 4.00s _try_schedule (ray/autoscaler/v2/scheduler.py)
0.00% 100.00% 0.000s 4.00s main (click/core.py)
0.00% 0.00% 0.000s 0.030s to_dict_list (ray/autoscaler/v2/utils.py)
0.00% 0.00% 0.000s 0.030s to_dict (ray/autoscaler/v2/utils.py)
0.00% 100.00% 0.000s 4.00s main (ray/scripts/scripts.py)
0.00% 100.00% 0.000s 4.00s reconcile (ray/autoscaler/v2/instance_manager/reconciler.py)
0.00% 0.00% 0.000s 0.010s _bootstrap_inner (threading.py)
0.00% 100.00% 0.000s 4.00s update_autoscaling_state (ray/autoscaler/v2/autoscaler.py)
Press Control-C to quit, or ? for help.
(base) ray@autoscaling-cluster-head:~$ py-spy top --pid 1
Any idea of what's happening?
@chiayi @ryanaoleary can you take a look at the stack trace in https://github.com/ray-project/kuberay/issues/3794#issuecomment-3070125687? cc @rueian as well
Just did a run and it took a couple of minutes for the cluster to actually start scaling. I used that time to record a profile of the autoscaler process in speedscope format.
Yeah like you said additional logging around _try_schedule just showed that the reconciler was stuck after "launching" (QUEUED->REQUESTED) one node and before transitioning each instance to the next state. I replicated the same issue in my cluster following these steps:
- Setup a GKE autopilot cluster and install KubeRay v1.4
- Create a RayCluster with the following spec:
apiVersion: ray.io/v1
kind: RayCluster
metadata:
name: autoscaling-cluster
spec:
# rayVersion: '2.46.0'
enableInTreeAutoscaling: true
# Configuration overrides for the autoscaler.
autoscalerOptions:
version: v2
upscalingMode: Default
idleTimeoutSeconds: 60
imagePullPolicy: IfNotPresent
securityContext: {}
env:
- name: AUTOSCALER_MAX_CONCURRENT_LAUNCHES
value: "100"
resources:
limits:
cpu: 1
memory: 5Gi
requests:
cpu: 1
memory: 5Gi
headGroupSpec:
serviceType: ClusterIP
rayStartParams:
dashboard-host: '0.0.0.0'
# No actor or task should run on the head pod
num-cpus: "0"
template:
metadata:
labels: {}
spec:
containers:
- name: ray-head
image: us-central1-docker.pkg.dev/ryanaoleary-gke-dev/ryanaoleary-ray/ray
# Setting requests=limits is recommended with Ray. K8s limits are used for Ray-internal
# resource accounting. K8s requests are not used by Ray.
resources:
limits:
cpu: 25
memory: 20Gi
requests:
cpu: 25
memory: 20Gi
ports:
- containerPort: 6379
name: gcs
- containerPort: 8265
name: dashboard
- containerPort: 10001
name: client
# Refer to https://kubernetes.io/docs/concepts/storage/volumes/
volumeMounts:
- mountPath: /tmp/ray
name: ray-logs
# Refer to https://kubernetes.io/docs/concepts/storage/volumes/
volumes:
- name: ray-logs
emptyDir: {}
workerGroupSpecs:
- replicas: 0
minReplicas: 0
maxReplicas: 350
groupName: large-group
rayStartParams: {}
#pod template
template:
spec:
containers:
- name: ray-worker
image: us-central1-docker.pkg.dev/ryanaoleary-gke-dev/ryanaoleary-ray/ray
resources:
limits:
cpu: 15
memory: 25Gi
requests:
cpu: 15
memory: 25Gi
# Refer to https://kubernetes.io/docs/concepts/storage/volumes/
volumeMounts:
- mountPath: /tmp/ray
name: ray-logs
# Refer to https://kubernetes.io/docs/concepts/storage/volumes/
volumes:
- name: ray-logs
emptyDir: {}
kubectl apply -f ray-test-upscale.yaml
RayCluster created:
kubectl get rayclusters
NAME DESIRED WORKERS AVAILABLE WORKERS CPUS MEMORY GPUS STATUS AGE
autoscaling-cluster 25 20Gi 0 5m41s
- Create the following test python script and submit as a Ray job:
import time
import ray
ray.init()
sleep_sec = 300
num_tasks = 4000
test_start = time.time()
@ray.remote(num_cpus=1)
def sleep_task(sleep_sec: int):
time.sleep(sleep_sec)
res_refs = [sleep_task.remote(sleep_sec) for _ in range(num_tasks - 1)]
res = ray.get(res_refs)
test_end = time.time()
run_time = test_end - test_start
print(
f"Scale test with {num_tasks} tasks ran in"
+ f" {run_time:.1f} secs.\n"
)
kubectl port-forward svc/autoscaling-cluster-head-svc 8265:8265 2>&1 >/dev/null &
ray job submit --submission-id upscale-test-job-1 --working-dir . --runtime-env-json='{"excludes": ["myenv"]}' --address http://localhost:8265 -- python upscale-test.py
Job submitted and running:
Logs show a hang for ~5 minutes after scaling up 1 Pod (REQUESTED->ALLOCATED state for cloud instance):
2025-07-16 22:47:49,395 - INFO - New instance QUEUED (id=900f8ce7-ba70-4c87-8ffb-17ab36eb2c67, type=large-group, cloud_instance_id=, ray_id=): queuing new instance of large-group from scheduler
2025-07-16 22:47:49,395 INFO instance_manager.py:246 -- New instance QUEUED (id=900f8ce7-ba70-4c87-8ffb-17ab36eb2c67, type=large-group, cloud_instance_id=, ray_id=): queuing new instance of large-group from scheduler
2025-07-16 22:47:49,400 - INFO - Update instance QUEUED->REQUESTED (id=22278c2e-23de-4ccd-9205-fe5f53013886, type=large-group, cloud_instance_id=, ray_id=): requested to launch large-group with request id ea0eef16-7c2a-4330-b4f5-815291c071d7
2025-07-16 22:47:49,400 INFO instance_manager.py:262 -- Update instance QUEUED->REQUESTED (id=22278c2e-23de-4ccd-9205-fe5f53013886, type=large-group, cloud_instance_id=, ray_id=): requested to launch large-group with request id ea0eef16-7c2a-4330-b4f5-815291c071d7
2025-07-16 22:47:49,401 - INFO - Refreshing K8s API client token and certs.
2025-07-16 22:47:49,401 INFO node_provider.py:277 -- Refreshing K8s API client token and certs.
2025-07-16 22:47:49,448 - INFO - Listing pods for RayCluster autoscaling-cluster in namespace default at pods resource version >= 1752716816740607021.
2025-07-16 22:47:49,448 INFO cloud_provider.py:463 -- Listing pods for RayCluster autoscaling-cluster in namespace default at pods resource version >= 1752716816740607021.
2025-07-16 22:47:49,458 - INFO - Fetched pod data at resource version 1752731268950348000.
2025-07-16 22:47:49,458 INFO cloud_provider.py:481 -- Fetched pod data at resource version 1752731268950348000.
2025-07-16 22:47:49,459 - INFO - Submitting a scale request: KubeRayProvider.ScaleRequest(desired_num_workers=defaultdict(<class 'int'>, {'large-group': 1}), workers_to_delete=defaultdict(<class 'list'>, {}), worker_groups_without_pending_deletes=set(), worker_groups_with_pending_deletes=set())
2025-07-16 22:47:49,459 INFO cloud_provider.py:331 -- Submitting a scale request: KubeRayProvider.ScaleRequest(desired_num_workers=defaultdict(<class 'int'>, {'large-group': 1}), workers_to_delete=defaultdict(<class 'list'>, {}), worker_groups_without_pending_deletes=set(), worker_groups_with_pending_deletes=set())
2025-07-16 22:47:54,519 - INFO - Refreshing K8s API client token and certs.
2025-07-16 22:47:54,519 INFO node_provider.py:277 -- Refreshing K8s API client token and certs.
2025-07-16 22:47:54,561 - INFO - Calculating hashes for file mounts and ray commands.
2025-07-16 22:47:54,561 INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-07-16 22:47:54,602 - INFO - Listing pods for RayCluster autoscaling-cluster in namespace default at pods resource version >= 1752716816740607021.
2025-07-16 22:47:54,602 INFO cloud_provider.py:463 -- Listing pods for RayCluster autoscaling-cluster in namespace default at pods resource version >= 1752716816740607021.
2025-07-16 22:47:54,611 - INFO - Fetched pod data at resource version 1752731271151951001.
2025-07-16 22:47:54,611 INFO cloud_provider.py:481 -- Fetched pod data at resource version 1752731271151951001.
2025-07-16 22:47:54,614 - INFO - Update instance REQUESTED->ALLOCATED (id=22278c2e-23de-4ccd-9205-fe5f53013886, type=large-group, cloud_instance_id=, ray_id=): allocated unassigned cloud instance autoscaling-cluster-large-group-worker-5dcdl
2025-07-16 22:47:54,614 INFO instance_manager.py:262 -- Update instance REQUESTED->ALLOCATED (id=22278c2e-23de-4ccd-9205-fe5f53013886, type=large-group, cloud_instance_id=, ray_id=): allocated unassigned cloud instance autoscaling-cluster-large-group-worker-5dcdl
2025-07-16 22:52:45,597 - INFO - Update instance QUEUED->REQUESTED (id=4e21d714-dd14-4e85-aef6-850c588878d5, type=large-group, cloud_instance_id=, ray_id=): requested to launch large-group with request id 80ea18b4-e8a5-42b1-80ad-3bab0b35c2b6
2025-07-16 22:52:45,597 INFO instance_manager.py:262 -- Update instance QUEUED->REQUESTED (id=4e21d714-dd14-4e85-aef6-850c588878d5, type=large-group, cloud_instance_id=, ray_id=): requested to launch large-group with request id 80ea18b4-e8a5-42b1-80ad-3bab0b35c2b6
2025-07-16 22:52:45,597 - INFO - Update instance QUEUED->REQUESTED (id=ad9a3a4c-c8b7-4c17-bc11-8cdcd2ee323b, type=large-group, cloud_instance_id=, ray_id=): requested to launch large-group with request id 80ea18b4-e8a5-42b1-80ad-3bab0b35c2b6
2025-07-16 22:52:45,597 INFO instance_manager.py:262 -- Update instance QUEUED->REQUESTED (id=ad9a3a4c-c8b7-4c17-bc11-8cdcd2ee323b, type=large-group, cloud_instance_id=, ray_id=): requested to launch large-group with request id 80ea18b4-e8a5-42b1-80ad-3bab0b35c2b6
I think the issue is this block of code that causes the singular initial node to launch and allocate a Pod, waiting another iteration before scaling AUTOSCALER_MAX_CONCURRENT_LAUNCHES and hanging here as the large number of requests each try to schedule on the one existing node (which is supported by the py_top profile which shows the large amount of time spent here):
num_desired_to_upscale = max(
1,
math.ceil(
upscaling_speed
* (
len(requested_instances_for_type)
+ len(allocated_instances_for_type)
)
),
)
Since on a cold start the autoscaler will always compute to launch one node, we end up a lot of time trying to fit any of the other 3999 requests on this one node (I believe with O(N² * M)). This could explain why the long hang occurs for 4000 tasks but not for 3000, with the autoscaler's 1 CPU getting overwhelmed by the polynomial computation.
An easy mitigation for this issue would just be to increase the amount of CPU for the autoscaler container:
autoscalerOptions:
version: v2
upscalingMode: Default
idleTimeoutSeconds: 60
imagePullPolicy: Always
securityContext: {}
env:
- name: AUTOSCALER_MAX_CONCURRENT_LAUNCHES
value: "100"
resources:
limits:
cpu: 12
memory: 5Gi
requests:
cpu: 12
memory: 5Gi
with the above spec and the same manual testing steps I did not encounter the ~5 minute hang. Besides the above mitigation, I think the way num_desired_to_upscale is calculated is incorrect, since limiting the initial scaling event to 1 instance doesn't make sense for Default or Aggressive upscalingMode. I just created a PR with a fix: https://github.com/ray-project/ray/pull/54699
I did not encounter the hang on cold-start when using this image: us-central1-docker.pkg.dev/ryanaoleary-gke-dev/ryanaoleary-ray/ray:fix-upscaling
# some logging I added
2025-07-17 01:12:19,116 - INFO - New instance QUEUED (id=b0326768-c385-48da-9c41-031332c70c28, type=large-group, cloud_instance_id=, ray_id=): queuing new instance of large-group from scheduler
2025-07-17 01:12:19,116 INFO instance_manager.py:247 -- New instance QUEUED (id=b0326768-c385-48da-9c41-031332c70c28, type=large-group, cloud_instance_id=, ray_id=): queuing new instance of large-group from scheduler
2025-07-17 01:12:19,117 - INFO - Computing instances to launch. State: 100 queued, 0 running, upscaling_speed: 1000.num_desired_to_upscale: 1000.Max concurrent launches: 100.
2025-07-17 01:12:19,117 INFO reconciler.py:860 -- Computing instances to launch. State: 100 queued, 0 running, upscaling_speed: 1000.num_desired_to_upscale: 1000.Max concurrent launches: 100.
2025-07-17 01:12:19,118 - INFO - Update instance QUEUED->REQUESTED (id=d2c48cd3-7171-46f1-89e9-9a2113552700, type=large-group, cloud_instance_id=, ray_id=): requested to launch large-group with request id c351c604-219a-496c-8f85-22f7856dc554
2025-07-17 01:12:19,118 INFO instance_manager.py:263 -- Update instance QUEUED->REQUESTED (id=d2c48cd3-7171-46f1-89e9-9a2113552700, type=large-group, cloud_instance_id=, ray_id=): requested to launch large-group with request id c351c604-219a-496c-8f85-22f7856dc554
2025-07-17 01:12:19,118 - INFO - Update instance QUEUED->REQUESTED (id=9cb496af-85f4-4424-84d7-cec769940b8d, type=large-group, cloud_instance_id=, ray_id=): requested to launch large-group with request id c351c604-219a-496c-8f85-22f7856dc554
2025-07-17 01:12:19,118 INFO instance_manager.py:263 -- Update instance QUEUED->REQUESTED (id=9cb496af-85f4-4424-84d7-cec769940b8d, type=large-group, cloud_instance_id=, ray_id=): requested to launch large-group with request id c351c604-219a-496c-8f85-22f7856dc554
I think another part of the issue besides the initial fix I included above is that resource requests passed to try_schedule and iterated over here should be grouped when identical. This would reduce the time complexity of trying to fit resource requests on in-flight nodes, since some logging I added is currently showing:
2025-07-17 01:12:14,935 - INFO - Fetched pod data at resource version 1752739930501757000.
2025-07-17 01:12:14,935 INFO cloud_provider.py:484 -- Fetched pod data at resource version 1752739930501757000.
2025-07-17 01:12:14,960 - INFO - Trying to schedule # resource requests: 3999
2025-07-17 01:12:14,960 INFO scheduler.py:425 -- Trying to schedule # resource requests: 3999
2025-07-17 01:12:14,960 - INFO - Starting try schedule
for each node. I think we can add a check here to first group requests by shape and then check if that shape fits (rather than each individual request) on a given node. This should optimize the amount of calls to _try_schedule_one which are causing the autoscaler to hang. What do you think @rueian ?
Great findings @ryanaoleary. Your explanation makes perfect sense, thanks.
As I mentioned in my first message we're currently evaluating Ray for running large scale AI workloads. So having a resolution to this issue is really good news.
FYI a critical blocker for us right now is https://github.com/ray-project/ray/issues/54321. If you have any idea of what could be causing that other issue, that would be great!
Sounds good I'll start taking a look at that issue as well. Another comment on the autoscaler issue, I believe there's a known issue where setting the Kubernetes Pod resources.limits leads to degraded performance. I haven't tested it yet but removing the limits block and just leaving requests under autoscalerOptions.resources could help improve the performance seen.
I suggested adding a way to set debug mode on the asyncio event loop to help understand what's happening. See https://github.com/ray-project/ray/issues/54321#issuecomment-3062707228
Another question: do you have a more comprehensive developer documentation than what's available here: https://docs.ray.io/en/latest/ray-contribute/development.html
I'd like to be able to build a Ray Docker image locally to help with troubleshooting. It's not clear to me after reading the docs and looking at the code what's required to build an image after making some changes to the python code.
Thanks!
Another question: do you have a more comprehensive developer documentation than what's available here: https://docs.ray.io/en/latest/ray-contribute/development.html
I'd like to be able to build a Ray Docker image locally to help with troubleshooting. It's not clear to me after reading the docs and looking at the code what's required to build an image after making some changes to the python code.
Thanks!
@testinfected I don't see a guide currently in the Ray docs for this, but these are my steps for building a dev image to test with:
- I follow the steps here to setup a virtual environment. I also usually install these dependencies and a Ray dev image following these steps: https://docs.ray.io/en/latest/ray-contribute/development.html#installing-additional-dependencies-for-development so that I can run tests.
- Create a Dockerfile like this (one currently exists here):
# Use the latest Ray master as base.
FROM rayproject/ray:nightly-py310
# Invalidate the cache so that fresh code is pulled in the next step.
ARG BUILD_DATE
# Retrieve your development code.
ADD . ray
# Install symlinks to your modified Python code.
RUN python ray/python/ray/setup-dev.py -y
- Run a script with the following to build/push the image. Make sure your dev branch is up-to-date with the latest changes from master before building the image (since it uses your local python code):
set -x
RAY_IMAGE=<docker repo you want to push to and image tag>
docker image rm $RAY_IMAGE
pushd ../../../..
docker build --progress=plain --build-arg BUILD_DATE="$(date +%Y-%m-%d:%H:%M:%S)" -t $RAY_IMAGE -f ./<YOUR_FOLDER>/Dockerfile . || exit
popd || exit
docker image push $RAY_IMAGE
I then replace the image used in my RayCluster with the one I built. I've ran into issues with the dev image crash looping the head Pod after being built, when I ran into it before it was due to the python dependencies getting messed up (not sure how) and I believe I fixed it by changing the python version to FROM rayproject/ray:nightly-py311, although py310 is currently working for me.
Ok will try that thanks
Thanks for the investigation and @testinfected for pointing me here. When would the fix above be available? It was mentioned that this is already in the nightly build but I was thinking like v2.49 or or something?