ray [Core] Actor/Task cannot be scheduled on worker node.

What happened + What you expected to happen

gcs lease actor to a node, but this node is very abnormal, constantly printing repeated logs into the raylet.out file. It appears that the node did not receive the 58000000 pubsub event and refused to pull the actor.

[2024-04-22 15:18:14,486 D 81 81] (raylet) cluster_resource_manager.cc:63: Update node info, node_id: 7838018267404916362, node_resources: {"resources":{cpuonly: 9990000/9990000, node:[fdbd:dc03:f:b30::31]: 10000/10000, CPU: 129000/160000, object_store_memory: 103079215100000/103079215100000, memory: 150323855360000/343597383680000}, "labels":{}
[2024-04-22 15:18:14,486 D 81 81] (raylet) local_task_manager.cc:492: This node has available resources, but no worker processes to grant the lease fd41d8cd55a9d34d25880431b8bdca5075bfecb64f000000
[2024-04-22 15:18:14,486 D 81 81] (raylet) worker_pool.cc:439: Job config of job 58000000 are not local yet.
[2024-04-22 15:18:14,486 D 81 81] (raylet) worker_pool.cc:1619: DeleteRuntimeEnvIfPossible {"_ray_commit": "{{RAY_COMMIT_SHA}}", "env_vars":.............
[2024-04-22 15:18:14,486 D 81 81] (raylet) cluster_resource_manager.cc:63: Update node info, node_id: 7838018267404916362, node_resources: {"resources":{CPU: 139000/160000, cpuonly: 9990000/9990000, memory: 214748364800000/343597383680000, object_store_memory: 103079215100000/103079215100000, node:[fdbd:dc03:f:b30::31]: 10000/10000}, "labels":{}
[2024-04-22 15:18:14,486 D 81 81] (raylet) local_task_manager.cc:492: This node has available resources, but no worker processes to grant the lease b089440550c0362da7984f4a1dd2463d251f4bf058000000
[2024-04-22 15:18:14,486 D 81 81] (raylet) worker_pool.cc:439: Job config of job 5e000000 are not local yet.
[2024-04-22 15:18:14,486 D 81 81] (raylet) worker_pool.cc:1619: DeleteRuntimeEnvIfPossible {"_ray_commit": "{{RAY_COMMIT_SHA}}", "env_va.....
[2024-04-22 15:18:14,486 D 81 81] (raylet) cluster_resource_manager.cc:63: Update node info, node_id: 7838018267404916362, node_resources: {"resources":{cpuonly: 9990000/9990000, node:[fdbd:dc03:f:b30::31]: 10000/10000, CPU: 140000/160000, object_store_memory: 103079215100000/103079215100000, memory: 214748364800000/343597383680000}, "labels":{}
[2024-04-22 15:18:14,486 D 81 81] (raylet) local_task_manager.cc:492: This node has available resources, but no worker processes to grant the lease ffffffffffffffff3e8271d29fef9310651a4e395e000000
[2024-04-22 15:18:14,487 D 81 81] (raylet) worker_pool.cc:439: Job config of job 58000000 are not local yet.
[2024-04-22 15:18:14,487 D 81 81] (raylet) worker_pool.cc:1619: DeleteRuntimeEnvIfPossible {"_ray_commit": "{{RAY_COMMIT_SHA}}", "env_vars"...
2024-04-22 15:18:14,487 D 81 81] (raylet) cluster_resource_manager.cc:63: Update node info, node_id: 7838018267404916362, node_resources: {"resources":{CPU: 150000/160000, cpuonly: 9990000/9990000, memory: 279172874240000/343597383680000, object_store_memory: 103079215100000/103079215100000, node:[fdbd:dc03:f:b30::31]: 10000/10000}, "labels":{}
[2024-04-22 15:18:14,487 D 81 81] (raylet) local_task_manager.cc:492: This node has available resources, but no worker processes to grant the lease 9ba01d5f242fe520d607ac8bc513e9f7e3423f2458000000
[2024-04-22 15:18:14,487 D 81 81] (raylet) worker_pool.cc:439: Job config of job 58000000 are not local yet.
[2024-04-22 15:18:14,487 D 81 81] (raylet) worker_pool.cc:1619: DeleteRuntimeEnvIfPossible {"_ray_commit": "{{RAY_COMMIT_SHA}}", "env_vars"

Check raylet.out states, this node got 153 GCS_JOB_CHANNEL message but other nodes got 183:

logs# tail -f raylet.out |grep GCS_JOB_CHANNEL

[state-dump]    Subscriber.HandlePublishedMessage_GCS_JOB_CHANNEL - 153 total (0 active), CPU time: mean = 153.039 us, total = 23.415 ms
[state-dump]    Subscriber.HandlePublishedMessage_GCS_JOB_CHANNEL - 153 total (0 active), CPU time: mean = 153.039 us, total = 23.415 ms
[state-dump]    Subscriber.HandlePublishedMessage_GCS_JOB_CHANNEL - 153 total (0 active), CPU time: mean = 153.039 us, total = 23.415 ms
[state-dump]    Subscriber.HandlePublishedMessage_GCS_JOB_CHANNEL - 153 total (0 active), CPU time: mean = 153.039 us, total = 23.415 ms
[state-dump]    Subscriber.HandlePublishedMessage_GCS_JOB_CHANNEL - 153 total (0 active), CPU time: mean = 153.039 us, total = 23.415 ms
[state-dump]    Subscriber.HandlePublishedMessage_GCS_JOB_CHANNEL - 153 total (0 active), CPU time: mean = 153.039 us, total = 23.415 ms
[state-dump]    Subscriber.HandlePublishedMessage_GCS_JOB_CHANNEL - 153 total (0 active), CPU time: mean = 153.039 us, total = 23.415 ms
[state-dump]    Subscriber.HandlePublishedMessage_GCS_JOB_CHANNEL - 153 total (0 active), CPU time: mean = 153.039 us, total = 23.415 ms
[state-dump]    Subscriber.HandlePublishedMessage_GCS_JOB_CHANNEL - 153 total (0 active), CPU time: mean = 153.039 us, total = 23.415 ms
[state-dump]    Subscriber.HandlePublishedMessage_GCS_JOB_CHANNEL - 153 total (0 active), CPU time: mean = 153.039 us, total = 23.415 ms

others:

logs# tail -f raylet.out |grep GCS_JOB_CHANNEL


[state-dump]    Subscriber.HandlePublishedMessage_GCS_JOB_CHANNEL - 183 total (0 active), CPU time: mean = 245.441 us, total = 44.916 ms
[state-dump]    Subscriber.HandlePublishedMessage_GCS_JOB_CHANNEL - 183 total (0 active), CPU time: mean = 245.441 us, total = 44.916 ms
[state-dump]    Subscriber.HandlePublishedMessage_GCS_JOB_CHANNEL - 183 total (0 active), CPU time: mean = 245.441 us, total = 44.916 ms
[state-dump]    Subscriber.HandlePublishedMessage_GCS_JOB_CHANNEL - 183 total (0 active), CPU time: mean = 245.441 us, total = 44.916 ms
[state-dump]    Subscriber.HandlePublishedMessage_GCS_JOB_CHANNEL - 183 total (0 active), CPU time: mean = 245.441 us, total = 44.916 ms
[state-dump]    Subscriber.HandlePublishedMessage_GCS_JOB_CHANNEL - 183 total (0 active), CPU time: mean = 245.441 us, total = 44.916 ms
[state-dump]    Subscriber.HandlePublishedMessage_GCS_JOB_CHANNEL - 183 total (0 active), CPU time: mean = 245.441 us, total = 44.916 ms
[state-dump]    Subscriber.HandlePublishedMessage_GCS_JOB_CHANNEL - 183 total (0 active), CPU time: mean = 245.441 us, total = 44.916 ms
[state-dump]    Subscriber.HandlePublishedMessage_GCS_JOB_CHANNEL - 183 total (0 active), CPU time: mean = 245.441 us, total = 44.916 ms
[state-dump]    Subscriber.HandlePublishedMessage_GCS_JOB_CHANNEL - 183 total (0 active), CPU time: mean = 245.441 us, total = 44.916 ms
[state-dump]    Subscriber.HandlePublishedMessage_GCS_JOB_CHANNEL - 183 total (0 active), CPU time: mean = 245.441 us, total = 44.916 ms
[state-dump]    Subscriber.HandlePublishedMessage_GCS_JOB_CHANNEL - 183 total (0 active), CPU time: mean = 245.441 us, total = 44.916 ms

Does this mean that the abnormal node is missing GCS_JOB_CHANNEL related messages？

Versions / Dependencies

Ray 2.6.1

Reproduction script

Occurs in a production environment and cannot provide a reproduce script.

Issue Severity

None

Apr 22 '24 07:04 MissiontoMars

Basically, it is determined that the reason is that the abnormal node cannot receive messages from GCS_JOB_CHANNEL.

Apr 22 '24 13:04 MissiontoMars

@MissiontoMars are you able to reproduce it with the latest Ray?

Apr 29 '24 21:04 jjyao

There might be environment specific factors involved in such failures (e.g. network issues). Without a repro script, we could not dig down the root cause and do anything with it.

May 01 '24 23:05 hongchaodeng

There might be environment specific factors involved in such failures (e.g. network issues). Without a repro script, we could not dig down the root cause and do anything with it.

This problem was found in production environments running large Raydata tasks, so it was also difficult to provide a simple reproducible script. For python tasks, it doesn't seem necessary to rely on job channel pubsub, because this channel is only for updating jobconfig, and python tasks don't depend on it. So we have temporarily fixed that scheduling tasks and actors does not depend on job channel pubsub.

May 02 '24 00:05 MissiontoMars