aibrix icon indicating copy to clipboard operation
aibrix copied to clipboard

stormservice deploy use mutiple node

Open ying2025 opened this issue 3 months ago • 4 comments

🚀 Feature Description and Motivation

I deploy stormservice with the podgroupsize: 2, it seem that the pod belong to the same podset donnot common inference the model.Is there demo for the case? Image

INFO 09-25 00:16:49 [ray_utils.py:334] No current placement group found. Creating a new placement group.
WARNING 09-25 00:16:49 [ray_utils.py:341] The number of required GPUs exceeds the total number of available GPUs in the placement group.
INFO 09-25 00:16:59 [ray_utils.py:232] Waiting for creating a placement group of specs for 10 seconds. specs=[{'GPU': 1.0, 'node:xx': 0.001}, {'GPU': 1.0}]. Check `ray status` and `ray list nodes` to see if you have enough resources, and make sure the IP addresses used by ray cluster are the same as VLLM_HOST_IP environment variable specified in each node if you are running on a multi-node.
INFO 09-25 00:17:19 [ray_utils.py:232] Waiting for creating a placement group of specs for 30 seconds. specs=[{'GPU': 1.0, 'node:xx': 0.001}, {'GPU': 1.0}]. Check `ray status` and `ray list nodes` to see if you have enough resources, and make sure the IP addresses used by ray cluster are the same as VLLM_HOST_IP environment variable specified in each node if you are running on a multi-node.
INFO 09-25 00:17:59 [ray_utils.py:232] Waiting for creating a placement group of specs for 70 seconds. specs=[{'GPU': 1.0, 'node:xxx': 0.001}, {'GPU': 1.0}]. Check `ray status` and `ray list nodes` to see if you have enough resources, and make sure the IP addresses used by ray cluster are the same as VLLM_HOST_IP environment variable specified in each node if you are running on a multi-node.

Use Case

    spec:
      roles:
        - name: prefill
          replicas: 2
          podGroupSize: 2
          stateful: true
          template:
            metadata:
              annotations:
                k8s.volcengine.com/pod-networks: |
                  [
                    {
                      "cniConf":{
                          "name":"rdma"
                      }
                    }
                  ]
              labels:
                model.aibrix.ai/name: qwen3-8B-podset
                model.aibrix.ai/port: "8000"
                model.aibrix.ai/engine: vllm
            spec:
              containers:
                - name: prefill
                  image: kvcache-container-image-hb2-cn-beijing.cr.volces.com/aibrix/vllm-openai:v0.9.2-cu128-nixl-v0.4.1-lmcache-0.3.1.post1
                  command: ["sh", "-c"]
                  args:
                    - |
                      python3 -m vllm.entrypoints.openai.api_server \
                      --host "0.0.0.0" \
                      --port "8000" \
                      --uvicorn-log-level warning \
                      --model /models/qwen/Qwen3-32B \
                      --gpu-memory-utilization 0.9 \
                      --tensor-parallel-size 1 \
                      --pipeline-parallel-size 2 \
                      --served-model-name qwen3-8B-podset \
                      --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}'  
                  env:
                    - name: PYTHONHASHSEED
                      value: "1047"
                    - name: VLLM_SERVER_DEV_MODE
                      value: "1"
                    - name: VLLM_NIXL_SIDE_CHANNEL_HOST
                      value: "0.0.0.0"
                    - name: VLLM_NIXL_SIDE_CHANNEL_PORT
                      value: "5558"
                    - name: VLLM_WORKER_MULTIPROC_METHOD
                      value: spawn
                    - name: VLLM_ENABLE_V1_MULTIPROCESSING
                      value: "0"
                    - name: GLOO_SOCKET_IFNAME
                      value: eth0
                    - name: NCCL_SOCKET_IFNAME
                      value: eth0
                    - name: NCCL_IB_DISABLE
                      value: "0"
                    - name: NCCL_IB_GID_INDEX
                      value: "7"
                    - name: NCCL_DEBUG
                      value: "INFO"
                    - name: UCX_TLS
                      value: ^gga
                  volumeMounts:                
                    - mountPath: /dev/shm
                      name: shared-mem
                  resources:
                    limits:
                      nvidia.com/gpu: 1
                  securityContext:
                    capabilities:
                      add:
                        - IPC_LOCK
              schedulerName: volcano
        - name: decode
          replicas: 1
          podGroupSize: 2
          stateful: true
          template:
            metadata:
              annotations:
                k8s.volcengine.com/pod-networks: |
                  [
                    {
                      "cniConf":{
                          "name":"rdma"
                      }
                    }
                  ]
              labels:
                model.aibrix.ai/name: qwen3-8B-podset
                model.aibrix.ai/port: "8000"
                model.aibrix.ai/engine: vllm
            spec:
              containers:
                - name: decode
                  image: kvcache-container-image-hb2-cn-beijing.cr.volces.com/aibrix/vllm-openai:v0.9.2-cu128-nixl-v0.4.1-lmcache-0.3.1.post1
                  command: ["sh", "-c"]
                  args:
                    - |
                      python3 -m vllm.entrypoints.openai.api_server \
                      --host "0.0.0.0" \
                      --port "8000" \
                      --uvicorn-log-level warning \
                      --model models/qwen/Qwen3-32B \
                      --gpu-memory-utilization 0.9 \
                      --tensor-parallel-size 1 \
                      --pipeline-parallel-size 2 \
                      --served-model-name qwen3-8B-podset \
                      --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}'   
                  env:
                    - name: PYTHONHASHSEED
                      value: "1047"
                    - name: VLLM_SERVER_DEV_MODE
                      value: "1"
                    - name: VLLM_NIXL_SIDE_CHANNEL_HOST
                      value: "0.0.0.0"
                    - name: VLLM_NIXL_SIDE_CHANNEL_PORT
                      value: "5558"
                    - name: VLLM_WORKER_MULTIPROC_METHOD
                      value: spawn
                    - name: VLLM_ENABLE_V1_MULTIPROCESSING
                      value: "0"
                    - name: GLOO_SOCKET_IFNAME
                      value: eth0
                    - name: NCCL_SOCKET_IFNAME
                      value: eth0
                    - name: NCCL_IB_DISABLE
                      value: "0"
                    - name: NCCL_IB_GID_INDEX
                      value: "7"
                    - name: NCCL_DEBUG
                      value: "INFO"
                    - name: UCX_TLS
                      value: ^gga

Otherwise prefill-b7d7f85cc-0-0 and prefill-b7d7f85cc-0-1 start the model separately as a pod

Proposed Solution

@Jeffwan No response

ying2025 avatar Sep 25 '25 07:09 ying2025

vllm need ray cluster to cross-node, we need to launch ray inside the vllm to use PodGroupSize. are you able to use sglang at this moment?

Jeffwan avatar Sep 26 '25 07:09 Jeffwan

vllm need ray cluster to cross-node, we need to launch ray inside the vllm to use PodGroupSize. are you able to use sglang at this moment?

yes, I can try to use sglang.

ying2025 avatar Sep 26 '25 08:09 ying2025

@Jeffwan Sglang: https://github.com/vllm-project/aibrix/blob/main/samples/disaggregation/sglang/tp-1p1d.yaml I add the podgroupsize, and I change the command

        replicas: 2
          podGroupSize: 2
                    python3 -m sglang.launch_server \
                        --model-path /models/qwen/Qwen3-8B \
                        --served-model-name qwen3-8B-tp \
                        --host 0.0.0.0 \
                        --port 30000 \
                        --disaggregation-mode decode \
                        --disaggregation-transfer-backend=nixl \
                        --trust-remote-code \
                        --dist-init-addr "${ROLESET_NAME}-${ROLE_NAME}-${ROLE_TEMPLATE_HASH}-0.${STORM_SERVICE_NAME}.aihc-pom.svc.cluster.local:5000" \
                        --nnodes 4 \
                        --node-rank $ROLE_REPLICA_INDEX \
                        --tp-size 4 \
                        --mem-fraction-static 0.8 \
                        --log-level debug

Prefill Pod 1-0

[W928 05:18:04.608896712 socket.cpp:755] [c10d] The IPv6 network addresses of (tp-1p1d-roleset-s2ngt-prefill-7bc5f679cd-0.tp-1p1d.xx-test.svc.cluster.local, 5000) cannot be retrieved (gai error: -2 - Name or service not known).

Prefill Pod 0-0

ster.local, 5000).
[W928 05:22:26.839086453 TCPStore.cpp:343] [c10d] TCP client failed to connect/validate to host tp-1p1d-roleset-s2ngt-prefill-7bc5f679cd-0.tp-1p1d.aibrix-test.svc.cluster.local:5000 - retrying (try=0, timeout=600000ms, delay=72872ms): The client socket has timed out after 600000ms while trying to connect to (tp-1p1d-roleset-s2ngt-prefill-7bc5f679cd-0.tp-1p1d.aibrix-test.svc.cluster.local, 5000).
Exception raised from throwTimeoutError at /pytorch/torch/csrc/distributed/c10d/socket.cpp:1021 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7f57c73785e8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5ba8bfe (0x7f57b06d5bfe in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x136920d (0x7f57abe9620d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5bf5791 (0x7f57b0722791 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: <unknown function> + 0x5bf5949 (0x7f57b0722949 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: <unknown function> + 0x5bf5d01 (0x7f57b0722d01 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0x5ba3eeb (0x7f57b06d0eeb in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #7: c10d::TCPStore::TCPStore(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, c10d::TCPStoreOptions const&) + 0x4b5 (0x7f57b06d37f5 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0xc1f385 (0x7f57bf575385 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #9: <unknown function> + 0xc53c74 (0x7f57bf5a9c74 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x3896be (0x7f57becdf6be in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #11: <unknown function> + 0x18ae52 (0x557fa5c40e52 in sglang::scheduler_TP0)
frame #12: _PyObject_MakeTpCall + 0x25b (0x557fa5c3777b in sglang::scheduler_TP0)
frame #13: <unknown function> + 0x198bfb (0x557fa5c4ebfb in sglang::scheduler_TP0)
frame #14: _PyObject_Call + 0x118 (0x557fa5c4f768 in sglang::scheduler_TP0)
frame #15: <unknown function> + 0x19526b (0x557fa5c4b26b in sglang::scheduler_TP0)
frame #16: <unknown function> + 0x181b1b (0x557fa5c37b1b in sglang::scheduler_TP0)
frame #17: <unknown function> + 0x38824b (0x7f57becde24b in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #18: _PyObject_MakeTpCall + 0x25b (0x557fa5c3777b in sglang::scheduler_TP0)
frame #19: _PyEval_EvalFrameDefault + 0x6907 (0x557fa5c31b77 in sglang::scheduler_TP0)
frame #20: _PyFunction_Vectorcall + 0x7c (0x557fa5c416ac in sglang::scheduler_TP0)
frame #21: _PyEval_EvalFrameDefault + 0x6c0 (0x557fa5c2b930 in sglang::scheduler_TP0)
frame #22: <unknown function> + 0x229a25 (0x557fa5cdfa25 in sglang::scheduler_TP0)
frame #23: <unknown function> + 0x18b909 (0x557fa5c41909 in sglang::scheduler_TP0)
frame #24: _PyEval_EvalFrameDefault + 0x6c0 (0x557fa5c2b930 in sglang::scheduler_TP0)
frame #25: _PyFunction_Vectorcall + 0x7c (0x557fa5c416ac in sglang::scheduler_TP0)
frame #26: PyObject_Call + 0x122 (0x557fa5c4f5c2 in sglang::scheduler_TP0)
frame #27: _PyEval_EvalFrameDefault + 0x2a7b (0x557fa5c2dceb in sglang::scheduler_TP0)
frame #28: _PyFunction_Vectorcall + 0x7c (0x557fa5c416ac in sglang::scheduler_TP0)
frame #29: PyObject_Call + 0x122 (0x557fa5c4f5c2 in sglang::scheduler_TP0)
frame #30: _PyEval_EvalFrameDefault + 0x2a7b (0x557fa5c2dceb in sglang::scheduler_TP0)
frame #31: _PyFunction_Vectorcall + 0x7c (0x557fa5c416ac in sglang::scheduler_TP0)
frame #32: _PyEval_EvalFrameDefault + 0x1987 (0x557fa5c2cbf7 in sglang::scheduler_TP0)
frame #33: _PyFunction_Vectorcall + 0x7c (0x557fa5c416ac in sglang::scheduler_TP0)
frame #34: _PyEval_EvalFrameDefault + 0x1987 (0x557fa5c2cbf7 in sglang::scheduler_TP0)
frame #35: _PyFunction_Vectorcall + 0x7c (0x557fa5c416ac in sglang::scheduler_TP0)
frame #36: _PyEval_EvalFrameDefault + 0x807 (0x557fa5c2ba77 in sglang::scheduler_TP0)
frame #37: _PyFunction_Vectorcall + 0x7c (0x557fa5c416ac in sglang::scheduler_TP0)
frame #38: _PyObject_FastCallDictTstate + 0x16d (0x557fa5c369fd in sglang::scheduler_TP0)
frame #39: <unknown function> + 0x194c44 (0x557fa5c4ac44 in sglang::scheduler_TP0)
frame #40: _PyObject_MakeTpCall + 0x1fc (0x557fa5c3771c in sglang::scheduler_TP0)
frame #41: _PyEval_EvalFrameDefault + 0x6907 (0x557fa5c31b77 in sglang::scheduler_TP0)
frame #42: _PyObject_FastCallDictTstate + 0xc4 (0x557fa5c36954 in sglang::scheduler_TP0)
frame #43: <unknown function> + 0x194ce5 (0x557fa5c4ace5 in sglang::scheduler_TP0)
frame #44: _PyObject_MakeTpCall + 0x1fc (0x557fa5c3771c in sglang::scheduler_TP0)
frame #45: _PyEval_EvalFrameDefault + 0x59c7 (0x557fa5c30c37 in sglang::scheduler_TP0)
frame #46: _PyFunction_Vectorcall + 0x7c (0x557fa5c416ac in sglang::scheduler_TP0)
frame #47: _PyObject_FastCallDictTstate + 0x16d (0x557fa5c369fd in sglang::scheduler_TP0)
frame #48: <unknown function> + 0x194c44 (0x557fa5c4ac44 in sglang::scheduler_TP0)
frame #49: _PyObject_MakeTpCall + 0x1fc (0x557fa5c3771c in sglang::scheduler_TP0)
frame #50: _PyEval_EvalFrameDefault + 0x6907 (0x557fa5c31b77 in sglang::scheduler_TP0)
frame #51: _PyObject_FastCallDictTstate + 0xc4 (0x557fa5c36954 in sglang::scheduler_TP0)
frame #52: <unknown function> + 0x194ce5 (0x557fa5c4ace5 in sglang::scheduler_TP0)
frame #53: _PyObject_MakeTpCall + 0x1fc (0x557fa5c3771c in sglang::scheduler_TP0)
frame #54: _PyEval_EvalFrameDefault + 0x59c7 (0x557fa5c30c37 in sglang::scheduler_TP0)
frame #55: _PyFunction_Vectorcall + 0x7c (0x557fa5c416ac in sglang::scheduler_TP0)
frame #56: _PyEval_EvalFrameDefault + 0x2a7b (0x557fa5c2dceb in sglang::scheduler_TP0)
frame #57: _PyFunction_Vectorcall + 0x7c (0x557fa5c416ac in sglang::scheduler_TP0)
frame #58: _PyEval_EvalFrameDefault + 0x807 (0x557fa5c2ba77 in sglang::scheduler_TP0)
frame #59: _PyFunction_Vectorcall + 0x7c (0x557fa5c416ac in sglang::scheduler_TP0)
frame #60: _PyEval_EvalFrameDefault + 0x807 (0x557fa5c2ba77 in sglang::scheduler_TP0)
frame #61: _PyFunction_Vectorcall + 0x7c (0x557fa5c416ac in sglang::scheduler_TP0)
frame #62: _PyEval_EvalFrameDefault + 0x6c0 (0x557fa5c2b930 in sglang::scheduler_TP0)

[W928 05:23:39.722606763 socket.cpp:755] [c10d] The IPv6 network addresses of (tp-1p1d-roleset-s2ngt-prefill-7bc5f679cd-0.tp-1p1d.aibrix-test.svc.cluster.local, 5000) cannot be retrieved (gai error: -2 - Name or service not known).
[W928 05:24:55.695737020 socket.cpp:755] [c10d] The IPv6 network addresses of (tp-1p1d-roleset-s2ngt-prefill-7bc5f679cd-0.tp-1p1d.aibrix-test.svc.cluster.local, 5000) cannot be retrieved (gai error: -2 - Name or service not known).
[W928 05:25:50.285622671 socket.cpp:755] [c10d] The IPv6 network addresses of (tp-1p1d-roleset-s2ngt-prefill-7bc5f679cd-0.tp-1p1d.aibrix-test.svc.cluster.local, 5000) cannot be retrieved (gai error: -2 - Name or service not known).
[W928 05:26:23.456642694 socket.cpp:755] [c10d] The IPv6 network addresses of (tp-1p1d-roleset-s2ngt-prefill-7bc5f679cd-0.tp-1p1d.aibrix-test.svc.cluster.local, 5000) cannot be retrieved (gai error: -2 - Name or service not known).

ping tp-1p1d-roleset-s2ngt-prefill-7bc5f679cd-0.tp-1p1d.aibrix-test.svc.cluster.local ping: unknown host

ying2025 avatar Sep 28 '25 12:09 ying2025

@ying2025 Is the issue still relevant?

varungup90 avatar Nov 17 '25 23:11 varungup90