stormservice deploy use mutiple node
🚀 Feature Description and Motivation
I deploy stormservice with the podgroupsize: 2, it seem that the pod belong to the same podset donnot common inference the model.Is there demo for the case?
INFO 09-25 00:16:49 [ray_utils.py:334] No current placement group found. Creating a new placement group.
WARNING 09-25 00:16:49 [ray_utils.py:341] The number of required GPUs exceeds the total number of available GPUs in the placement group.
INFO 09-25 00:16:59 [ray_utils.py:232] Waiting for creating a placement group of specs for 10 seconds. specs=[{'GPU': 1.0, 'node:xx': 0.001}, {'GPU': 1.0}]. Check `ray status` and `ray list nodes` to see if you have enough resources, and make sure the IP addresses used by ray cluster are the same as VLLM_HOST_IP environment variable specified in each node if you are running on a multi-node.
INFO 09-25 00:17:19 [ray_utils.py:232] Waiting for creating a placement group of specs for 30 seconds. specs=[{'GPU': 1.0, 'node:xx': 0.001}, {'GPU': 1.0}]. Check `ray status` and `ray list nodes` to see if you have enough resources, and make sure the IP addresses used by ray cluster are the same as VLLM_HOST_IP environment variable specified in each node if you are running on a multi-node.
INFO 09-25 00:17:59 [ray_utils.py:232] Waiting for creating a placement group of specs for 70 seconds. specs=[{'GPU': 1.0, 'node:xxx': 0.001}, {'GPU': 1.0}]. Check `ray status` and `ray list nodes` to see if you have enough resources, and make sure the IP addresses used by ray cluster are the same as VLLM_HOST_IP environment variable specified in each node if you are running on a multi-node.
Use Case
spec:
roles:
- name: prefill
replicas: 2
podGroupSize: 2
stateful: true
template:
metadata:
annotations:
k8s.volcengine.com/pod-networks: |
[
{
"cniConf":{
"name":"rdma"
}
}
]
labels:
model.aibrix.ai/name: qwen3-8B-podset
model.aibrix.ai/port: "8000"
model.aibrix.ai/engine: vllm
spec:
containers:
- name: prefill
image: kvcache-container-image-hb2-cn-beijing.cr.volces.com/aibrix/vllm-openai:v0.9.2-cu128-nixl-v0.4.1-lmcache-0.3.1.post1
command: ["sh", "-c"]
args:
- |
python3 -m vllm.entrypoints.openai.api_server \
--host "0.0.0.0" \
--port "8000" \
--uvicorn-log-level warning \
--model /models/qwen/Qwen3-32B \
--gpu-memory-utilization 0.9 \
--tensor-parallel-size 1 \
--pipeline-parallel-size 2 \
--served-model-name qwen3-8B-podset \
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}'
env:
- name: PYTHONHASHSEED
value: "1047"
- name: VLLM_SERVER_DEV_MODE
value: "1"
- name: VLLM_NIXL_SIDE_CHANNEL_HOST
value: "0.0.0.0"
- name: VLLM_NIXL_SIDE_CHANNEL_PORT
value: "5558"
- name: VLLM_WORKER_MULTIPROC_METHOD
value: spawn
- name: VLLM_ENABLE_V1_MULTIPROCESSING
value: "0"
- name: GLOO_SOCKET_IFNAME
value: eth0
- name: NCCL_SOCKET_IFNAME
value: eth0
- name: NCCL_IB_DISABLE
value: "0"
- name: NCCL_IB_GID_INDEX
value: "7"
- name: NCCL_DEBUG
value: "INFO"
- name: UCX_TLS
value: ^gga
volumeMounts:
- mountPath: /dev/shm
name: shared-mem
resources:
limits:
nvidia.com/gpu: 1
securityContext:
capabilities:
add:
- IPC_LOCK
schedulerName: volcano
- name: decode
replicas: 1
podGroupSize: 2
stateful: true
template:
metadata:
annotations:
k8s.volcengine.com/pod-networks: |
[
{
"cniConf":{
"name":"rdma"
}
}
]
labels:
model.aibrix.ai/name: qwen3-8B-podset
model.aibrix.ai/port: "8000"
model.aibrix.ai/engine: vllm
spec:
containers:
- name: decode
image: kvcache-container-image-hb2-cn-beijing.cr.volces.com/aibrix/vllm-openai:v0.9.2-cu128-nixl-v0.4.1-lmcache-0.3.1.post1
command: ["sh", "-c"]
args:
- |
python3 -m vllm.entrypoints.openai.api_server \
--host "0.0.0.0" \
--port "8000" \
--uvicorn-log-level warning \
--model models/qwen/Qwen3-32B \
--gpu-memory-utilization 0.9 \
--tensor-parallel-size 1 \
--pipeline-parallel-size 2 \
--served-model-name qwen3-8B-podset \
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}'
env:
- name: PYTHONHASHSEED
value: "1047"
- name: VLLM_SERVER_DEV_MODE
value: "1"
- name: VLLM_NIXL_SIDE_CHANNEL_HOST
value: "0.0.0.0"
- name: VLLM_NIXL_SIDE_CHANNEL_PORT
value: "5558"
- name: VLLM_WORKER_MULTIPROC_METHOD
value: spawn
- name: VLLM_ENABLE_V1_MULTIPROCESSING
value: "0"
- name: GLOO_SOCKET_IFNAME
value: eth0
- name: NCCL_SOCKET_IFNAME
value: eth0
- name: NCCL_IB_DISABLE
value: "0"
- name: NCCL_IB_GID_INDEX
value: "7"
- name: NCCL_DEBUG
value: "INFO"
- name: UCX_TLS
value: ^gga
Otherwise prefill-b7d7f85cc-0-0 and prefill-b7d7f85cc-0-1 start the model separately as a pod
Proposed Solution
@Jeffwan No response
vllm need ray cluster to cross-node, we need to launch ray inside the vllm to use PodGroupSize. are you able to use sglang at this moment?
vllm need ray cluster to cross-node, we need to launch ray inside the vllm to use PodGroupSize. are you able to use sglang at this moment?
yes, I can try to use sglang.
@Jeffwan Sglang: https://github.com/vllm-project/aibrix/blob/main/samples/disaggregation/sglang/tp-1p1d.yaml I add the podgroupsize, and I change the command
replicas: 2
podGroupSize: 2
python3 -m sglang.launch_server \
--model-path /models/qwen/Qwen3-8B \
--served-model-name qwen3-8B-tp \
--host 0.0.0.0 \
--port 30000 \
--disaggregation-mode decode \
--disaggregation-transfer-backend=nixl \
--trust-remote-code \
--dist-init-addr "${ROLESET_NAME}-${ROLE_NAME}-${ROLE_TEMPLATE_HASH}-0.${STORM_SERVICE_NAME}.aihc-pom.svc.cluster.local:5000" \
--nnodes 4 \
--node-rank $ROLE_REPLICA_INDEX \
--tp-size 4 \
--mem-fraction-static 0.8 \
--log-level debug
Prefill Pod 1-0
[W928 05:18:04.608896712 socket.cpp:755] [c10d] The IPv6 network addresses of (tp-1p1d-roleset-s2ngt-prefill-7bc5f679cd-0.tp-1p1d.xx-test.svc.cluster.local, 5000) cannot be retrieved (gai error: -2 - Name or service not known).
Prefill Pod 0-0
ster.local, 5000).
[W928 05:22:26.839086453 TCPStore.cpp:343] [c10d] TCP client failed to connect/validate to host tp-1p1d-roleset-s2ngt-prefill-7bc5f679cd-0.tp-1p1d.aibrix-test.svc.cluster.local:5000 - retrying (try=0, timeout=600000ms, delay=72872ms): The client socket has timed out after 600000ms while trying to connect to (tp-1p1d-roleset-s2ngt-prefill-7bc5f679cd-0.tp-1p1d.aibrix-test.svc.cluster.local, 5000).
Exception raised from throwTimeoutError at /pytorch/torch/csrc/distributed/c10d/socket.cpp:1021 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7f57c73785e8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5ba8bfe (0x7f57b06d5bfe in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x136920d (0x7f57abe9620d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5bf5791 (0x7f57b0722791 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: <unknown function> + 0x5bf5949 (0x7f57b0722949 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: <unknown function> + 0x5bf5d01 (0x7f57b0722d01 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0x5ba3eeb (0x7f57b06d0eeb in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #7: c10d::TCPStore::TCPStore(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, c10d::TCPStoreOptions const&) + 0x4b5 (0x7f57b06d37f5 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0xc1f385 (0x7f57bf575385 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #9: <unknown function> + 0xc53c74 (0x7f57bf5a9c74 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x3896be (0x7f57becdf6be in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #11: <unknown function> + 0x18ae52 (0x557fa5c40e52 in sglang::scheduler_TP0)
frame #12: _PyObject_MakeTpCall + 0x25b (0x557fa5c3777b in sglang::scheduler_TP0)
frame #13: <unknown function> + 0x198bfb (0x557fa5c4ebfb in sglang::scheduler_TP0)
frame #14: _PyObject_Call + 0x118 (0x557fa5c4f768 in sglang::scheduler_TP0)
frame #15: <unknown function> + 0x19526b (0x557fa5c4b26b in sglang::scheduler_TP0)
frame #16: <unknown function> + 0x181b1b (0x557fa5c37b1b in sglang::scheduler_TP0)
frame #17: <unknown function> + 0x38824b (0x7f57becde24b in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #18: _PyObject_MakeTpCall + 0x25b (0x557fa5c3777b in sglang::scheduler_TP0)
frame #19: _PyEval_EvalFrameDefault + 0x6907 (0x557fa5c31b77 in sglang::scheduler_TP0)
frame #20: _PyFunction_Vectorcall + 0x7c (0x557fa5c416ac in sglang::scheduler_TP0)
frame #21: _PyEval_EvalFrameDefault + 0x6c0 (0x557fa5c2b930 in sglang::scheduler_TP0)
frame #22: <unknown function> + 0x229a25 (0x557fa5cdfa25 in sglang::scheduler_TP0)
frame #23: <unknown function> + 0x18b909 (0x557fa5c41909 in sglang::scheduler_TP0)
frame #24: _PyEval_EvalFrameDefault + 0x6c0 (0x557fa5c2b930 in sglang::scheduler_TP0)
frame #25: _PyFunction_Vectorcall + 0x7c (0x557fa5c416ac in sglang::scheduler_TP0)
frame #26: PyObject_Call + 0x122 (0x557fa5c4f5c2 in sglang::scheduler_TP0)
frame #27: _PyEval_EvalFrameDefault + 0x2a7b (0x557fa5c2dceb in sglang::scheduler_TP0)
frame #28: _PyFunction_Vectorcall + 0x7c (0x557fa5c416ac in sglang::scheduler_TP0)
frame #29: PyObject_Call + 0x122 (0x557fa5c4f5c2 in sglang::scheduler_TP0)
frame #30: _PyEval_EvalFrameDefault + 0x2a7b (0x557fa5c2dceb in sglang::scheduler_TP0)
frame #31: _PyFunction_Vectorcall + 0x7c (0x557fa5c416ac in sglang::scheduler_TP0)
frame #32: _PyEval_EvalFrameDefault + 0x1987 (0x557fa5c2cbf7 in sglang::scheduler_TP0)
frame #33: _PyFunction_Vectorcall + 0x7c (0x557fa5c416ac in sglang::scheduler_TP0)
frame #34: _PyEval_EvalFrameDefault + 0x1987 (0x557fa5c2cbf7 in sglang::scheduler_TP0)
frame #35: _PyFunction_Vectorcall + 0x7c (0x557fa5c416ac in sglang::scheduler_TP0)
frame #36: _PyEval_EvalFrameDefault + 0x807 (0x557fa5c2ba77 in sglang::scheduler_TP0)
frame #37: _PyFunction_Vectorcall + 0x7c (0x557fa5c416ac in sglang::scheduler_TP0)
frame #38: _PyObject_FastCallDictTstate + 0x16d (0x557fa5c369fd in sglang::scheduler_TP0)
frame #39: <unknown function> + 0x194c44 (0x557fa5c4ac44 in sglang::scheduler_TP0)
frame #40: _PyObject_MakeTpCall + 0x1fc (0x557fa5c3771c in sglang::scheduler_TP0)
frame #41: _PyEval_EvalFrameDefault + 0x6907 (0x557fa5c31b77 in sglang::scheduler_TP0)
frame #42: _PyObject_FastCallDictTstate + 0xc4 (0x557fa5c36954 in sglang::scheduler_TP0)
frame #43: <unknown function> + 0x194ce5 (0x557fa5c4ace5 in sglang::scheduler_TP0)
frame #44: _PyObject_MakeTpCall + 0x1fc (0x557fa5c3771c in sglang::scheduler_TP0)
frame #45: _PyEval_EvalFrameDefault + 0x59c7 (0x557fa5c30c37 in sglang::scheduler_TP0)
frame #46: _PyFunction_Vectorcall + 0x7c (0x557fa5c416ac in sglang::scheduler_TP0)
frame #47: _PyObject_FastCallDictTstate + 0x16d (0x557fa5c369fd in sglang::scheduler_TP0)
frame #48: <unknown function> + 0x194c44 (0x557fa5c4ac44 in sglang::scheduler_TP0)
frame #49: _PyObject_MakeTpCall + 0x1fc (0x557fa5c3771c in sglang::scheduler_TP0)
frame #50: _PyEval_EvalFrameDefault + 0x6907 (0x557fa5c31b77 in sglang::scheduler_TP0)
frame #51: _PyObject_FastCallDictTstate + 0xc4 (0x557fa5c36954 in sglang::scheduler_TP0)
frame #52: <unknown function> + 0x194ce5 (0x557fa5c4ace5 in sglang::scheduler_TP0)
frame #53: _PyObject_MakeTpCall + 0x1fc (0x557fa5c3771c in sglang::scheduler_TP0)
frame #54: _PyEval_EvalFrameDefault + 0x59c7 (0x557fa5c30c37 in sglang::scheduler_TP0)
frame #55: _PyFunction_Vectorcall + 0x7c (0x557fa5c416ac in sglang::scheduler_TP0)
frame #56: _PyEval_EvalFrameDefault + 0x2a7b (0x557fa5c2dceb in sglang::scheduler_TP0)
frame #57: _PyFunction_Vectorcall + 0x7c (0x557fa5c416ac in sglang::scheduler_TP0)
frame #58: _PyEval_EvalFrameDefault + 0x807 (0x557fa5c2ba77 in sglang::scheduler_TP0)
frame #59: _PyFunction_Vectorcall + 0x7c (0x557fa5c416ac in sglang::scheduler_TP0)
frame #60: _PyEval_EvalFrameDefault + 0x807 (0x557fa5c2ba77 in sglang::scheduler_TP0)
frame #61: _PyFunction_Vectorcall + 0x7c (0x557fa5c416ac in sglang::scheduler_TP0)
frame #62: _PyEval_EvalFrameDefault + 0x6c0 (0x557fa5c2b930 in sglang::scheduler_TP0)
[W928 05:23:39.722606763 socket.cpp:755] [c10d] The IPv6 network addresses of (tp-1p1d-roleset-s2ngt-prefill-7bc5f679cd-0.tp-1p1d.aibrix-test.svc.cluster.local, 5000) cannot be retrieved (gai error: -2 - Name or service not known).
[W928 05:24:55.695737020 socket.cpp:755] [c10d] The IPv6 network addresses of (tp-1p1d-roleset-s2ngt-prefill-7bc5f679cd-0.tp-1p1d.aibrix-test.svc.cluster.local, 5000) cannot be retrieved (gai error: -2 - Name or service not known).
[W928 05:25:50.285622671 socket.cpp:755] [c10d] The IPv6 network addresses of (tp-1p1d-roleset-s2ngt-prefill-7bc5f679cd-0.tp-1p1d.aibrix-test.svc.cluster.local, 5000) cannot be retrieved (gai error: -2 - Name or service not known).
[W928 05:26:23.456642694 socket.cpp:755] [c10d] The IPv6 network addresses of (tp-1p1d-roleset-s2ngt-prefill-7bc5f679cd-0.tp-1p1d.aibrix-test.svc.cluster.local, 5000) cannot be retrieved (gai error: -2 - Name or service not known).
ping tp-1p1d-roleset-s2ngt-prefill-7bc5f679cd-0.tp-1p1d.aibrix-test.svc.cluster.local ping: unknown host
@ying2025 Is the issue still relevant?