aibrix icon indicating copy to clipboard operation
aibrix copied to clipboard

Multi-Node Inference ' Waiting for creating a placement group of specs for 310 seconds'

Open ying2025 opened this issue 8 months ago • 3 comments

🚀 Feature Description and Motivation

NING 04-01 02:45:50 ray_utils.py:320] The number of required GPUs exceeds the total number of available GPUs in the placement group.
INFO 04-01 02:46:00 ray_utils.py:214] Waiting for creating a placement group of specs for 10 seconds. specs=[{'node:10.163.194.101': 0.001, 'GPU': 1.0}, {'GPU': 1.0}]. Check `ray status` to see if you have enough resources, and make sure the IP addresses used by ray cluster are the same as VLLM_HOST_IP environment variable specified in each node if you are running on a multi-node.
INFO 04-01 02:46:20 ray_utils.py:214] Waiting for creating a placement group of specs for 30 seconds. specs=[{'node:10.163.194.101': 0.001, 'GPU': 1.0}, {'GPU': 1.0}]. Check `ray status` to see if you have enough resources, and make sure the IP addresses used by ray cluster are the same as VLLM_HOST_IP environment variable specified in each node if you are running on a multi-node.
INFO 04-01 02:47:00 ray_utils.py:214] Waiting for creating a placement group of specs for 70 seconds. specs=[{'node:10.163.194.101': 0.001, 'GPU': 1.0}, {'GPU': 1.0}]. Check `ray status` to see if you have enough resources, and make sure the IP addresses used by ray cluster are the same as VLLM_HOST_IP environment variable specified in each node if you are running on a multi-node.
INFO 04-01 02:48:20 ray_utils.py:214] Waiting for creating a placement group of specs for 150 seconds. specs=[{'node:10.163.194.101': 0.001, 'GPU': 1.0}, {'GPU': 1.0}]. Check `ray status` to see if you have enough resources, and make sure the IP addresses used by ray cluster are the same as VLLM_HOST_IP environment variable specified in each node if you are running on a multi-node.
INFO 04-01 02:51:00 ray_utils.py:214] Waiting for creating a placement group of specs for 310 seconds. specs=[{'node:10.163.194.101': 0.001, 'GPU': 1.0}, {'GPU': 1.0}]. Check `ray status` to see if you have enough resources, and make sure the IP addresses used by ray cluster are the same as VLLM_HOST_IP environment variable specified in each node if you are running on a multi-node.
INFO 04-01 02:56:20 ray_utils.py:214] Waiting for creating a placement group of specs for 630 seconds. specs=[{'node:10.163.194.101': 0.001, 'GPU': 1.0}, {'GPU': 1.0}]. Check `ray status` to see if you have enough resources, and make sure the IP addresses used by ray cluster are the same as VLLM_HOST_IP environment variable specified in each node if you are running on a multi-node.

Use Case

I use the codehttps://github.com/vllm-project/aibrix/blob/main/samples/distributed/fleet-two-node.yaml run ray examble

Proposed Solution

No response

ying2025 avatar Apr 01 '25 09:04 ying2025

@ying2025 Did you check whether all your pods are ready? Seems there's only 1 visible GPU.

Jeffwan avatar Apr 01 '25 10:04 Jeffwan

Did you check whether all your pods are ready? Seems there's only 1 visible GPU. The gpu resource is sufficient, and head and work schedule to the same node.The head pod is waiting for the work pod to start, but the work pod remains in PodInitializing state.

qwen-coder-7b-instruct-v3-5988f954f5-kf24g-head-kpv2f             0/1     Running            2 (119s ago)       58m    
qwen-coder-7b-instruct-v3-5988f954f5-kf24g-small-g-worker-4wvfn   0/1     Init:0/1           0                  58m

ying2025 avatar Apr 02 '25 02:04 ying2025

@Jeffwan Now I have enought resource.

# the head `mifs-nodetest166-hg-ray-67ff5218c47c150-head-wk5dx`:
Waiting for creating a placement group of specs for 150 seconds. specs=[{'GPU': 1.0, 'node:xxx': 0.001}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}]. Check `ray status` to see if you have enough resources, and make sure the IP addresses used by ray cluster are the same as VLLM_HOST_IP environment variable specified in each node if you are running on a multi-node.
mifs-nodetest166-hg-ray-67ff5218c47c150-head-wk5dx     0/1     Running    0          6m23s   (l20)
mifs-nodetest166-hg-ray-67ff5218c47c150-worker-bp7pn   0/1     Init:0/1   0          6m23s      (l20)
mifs-nodetest166-hg-ray-67ff5218c47c150-worker-pn655   1/1     Running    0          6m23s  (4090)
mifs-nodetest166-hg-ray-67ff5218c47c150-worker-rqn5n   1/1     Running    0          6m22s (4090)

ying2025 avatar Apr 16 '25 07:04 ying2025