aibrix
aibrix copied to clipboard
Multi-Node Inference ' Waiting for creating a placement group of specs for 310 seconds'
🚀 Feature Description and Motivation
NING 04-01 02:45:50 ray_utils.py:320] The number of required GPUs exceeds the total number of available GPUs in the placement group.
INFO 04-01 02:46:00 ray_utils.py:214] Waiting for creating a placement group of specs for 10 seconds. specs=[{'node:10.163.194.101': 0.001, 'GPU': 1.0}, {'GPU': 1.0}]. Check `ray status` to see if you have enough resources, and make sure the IP addresses used by ray cluster are the same as VLLM_HOST_IP environment variable specified in each node if you are running on a multi-node.
INFO 04-01 02:46:20 ray_utils.py:214] Waiting for creating a placement group of specs for 30 seconds. specs=[{'node:10.163.194.101': 0.001, 'GPU': 1.0}, {'GPU': 1.0}]. Check `ray status` to see if you have enough resources, and make sure the IP addresses used by ray cluster are the same as VLLM_HOST_IP environment variable specified in each node if you are running on a multi-node.
INFO 04-01 02:47:00 ray_utils.py:214] Waiting for creating a placement group of specs for 70 seconds. specs=[{'node:10.163.194.101': 0.001, 'GPU': 1.0}, {'GPU': 1.0}]. Check `ray status` to see if you have enough resources, and make sure the IP addresses used by ray cluster are the same as VLLM_HOST_IP environment variable specified in each node if you are running on a multi-node.
INFO 04-01 02:48:20 ray_utils.py:214] Waiting for creating a placement group of specs for 150 seconds. specs=[{'node:10.163.194.101': 0.001, 'GPU': 1.0}, {'GPU': 1.0}]. Check `ray status` to see if you have enough resources, and make sure the IP addresses used by ray cluster are the same as VLLM_HOST_IP environment variable specified in each node if you are running on a multi-node.
INFO 04-01 02:51:00 ray_utils.py:214] Waiting for creating a placement group of specs for 310 seconds. specs=[{'node:10.163.194.101': 0.001, 'GPU': 1.0}, {'GPU': 1.0}]. Check `ray status` to see if you have enough resources, and make sure the IP addresses used by ray cluster are the same as VLLM_HOST_IP environment variable specified in each node if you are running on a multi-node.
INFO 04-01 02:56:20 ray_utils.py:214] Waiting for creating a placement group of specs for 630 seconds. specs=[{'node:10.163.194.101': 0.001, 'GPU': 1.0}, {'GPU': 1.0}]. Check `ray status` to see if you have enough resources, and make sure the IP addresses used by ray cluster are the same as VLLM_HOST_IP environment variable specified in each node if you are running on a multi-node.
Use Case
I use the codehttps://github.com/vllm-project/aibrix/blob/main/samples/distributed/fleet-two-node.yaml run ray examble
Proposed Solution
No response
@ying2025 Did you check whether all your pods are ready? Seems there's only 1 visible GPU.
Did you check whether all your pods are ready? Seems there's only 1 visible GPU. The gpu resource is sufficient, and head and work schedule to the same node.The head pod is waiting for the work pod to start, but the work pod remains in PodInitializing state.
qwen-coder-7b-instruct-v3-5988f954f5-kf24g-head-kpv2f 0/1 Running 2 (119s ago) 58m
qwen-coder-7b-instruct-v3-5988f954f5-kf24g-small-g-worker-4wvfn 0/1 Init:0/1 0 58m
@Jeffwan Now I have enought resource.
# the head `mifs-nodetest166-hg-ray-67ff5218c47c150-head-wk5dx`:
Waiting for creating a placement group of specs for 150 seconds. specs=[{'GPU': 1.0, 'node:xxx': 0.001}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}]. Check `ray status` to see if you have enough resources, and make sure the IP addresses used by ray cluster are the same as VLLM_HOST_IP environment variable specified in each node if you are running on a multi-node.
mifs-nodetest166-hg-ray-67ff5218c47c150-head-wk5dx 0/1 Running 0 6m23s (l20)
mifs-nodetest166-hg-ray-67ff5218c47c150-worker-bp7pn 0/1 Init:0/1 0 6m23s (l20)
mifs-nodetest166-hg-ray-67ff5218c47c150-worker-pn655 1/1 Running 0 6m23s (4090)
mifs-nodetest166-hg-ray-67ff5218c47c150-worker-rqn5n 1/1 Running 0 6m22s (4090)