aibrix icon indicating copy to clipboard operation
aibrix copied to clipboard

aibrix cant find correct router

Open XiaobinZhao opened this issue 2 months ago • 15 comments

🐛 Describe the bug

issue simple description

1P(TP16)1D (TP16) and 2P(TP=16)1D(TP=16) is working normally, but 4P(TP8)1D(TP16) failed, when i run the curl :

LB_IP=$(kubectl get svc/envoy-aibrix-system-aibrix-eg-903790dc -n envoy-gateway-system -o=jsonpath='{.status.loadBalancer.ingress[0].ip}')
ENDPOINT="${LB_IP}:80"

curl -v http://${ENDPOINT}/v1/models/

curl -v http://${ENDPOINT}/v1/chat/completions -H "routing-strategy: pd" -H "Content-Type: application/json" -d '{
    "model": "DeepSeek-R1",
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "help me write a random generator in python"}
    ],
    "temperature": 0.7,
     "max_tokens": 10
}'

output is:

{"error":{"code":null,"message":"httproutes.gateway.networking.k8s.io \"DeepSeek-R1-router\" not found","param":null,"type":"api_error"}}

my yaml: sglang-4P1D.yaml

apiVersion: orchestration.aibrix.ai/v1alpha1
kind: StormService
metadata:
  name: sglang-1p1d
spec:
  replicas: 1
  updateStrategy:
    type: InPlaceUpdate
  stateful: true
  selector:
    matchLabels:
      app: sglang-1p1d
  template:
    metadata:
      labels:
        app: sglang-1p1d
    spec:
      roles:
        - name: prefill
          replicas: 4
          podGroupSize: 1
          stateful: true
          template:
            metadata:
              labels:
                model.aibrix.ai/name: DeepSeek-R1
                model.aibrix.ai/port: "30000"
                model.aibrix.ai/engine: sglang
            spec:
              affinity:
                nodeAffinity:
                  requiredDuringSchedulingIgnoredDuringExecution:
                    nodeSelectorTerms:
                      - matchExpressions:
                          - key: kubernetes.io/hostname
                            operator: In
                            values:
                              - pod1-gpu-027
                              - pod1-gpu-028
                              - pod1-gpu-029
                              - pod1-gpu-030
                              - pod1-gpu-031
                              - pod1-gpu-032
              containers:
                - name: prefill
                  image: 10.24.10.61:20405/sglang:v0.4.10-deepseek3.1-0822-my-re_mooncake
                  command: ["sh", "-c"]
                  args:
                    - |
                      python3 -m sglang.launch_server \
                        --model-path /llm/deepseek/DeepSeek-R1-0528-full \
                        --served-model-name DeepSeek-R1 \
                        --host 0.0.0.0 \
                        --port 30000 \
                        --mem-fraction-static 0.9 \
                        --tp-size 8
                  env:
                    - name: GLOO_SOCKET_IFNAME
                      value: eth0
                    - name: NCCL_SOCKET_IFNAME
                      value: eth0
                    - name: NCCL_IB_DISABLE
                      value: "0"
                    - name: NCCL_IB_GID_INDEX
                      value: "0"
                    - name: NCCL_DEBUG
                      value: "WARN"
                    - name: TORCH_CUDA_ARCH_LIST
                      value: "9.0"
                  volumeMounts:
                    - name: model-vol
                      mountPath: /llm
                    - mountPath: /dev/shm
                      name: shared-mem
                  resources:
                    limits:
                      nvidia.com/gpu: 8
                  securityContext:
                    allowPrivilegeEscalation: true
                    readOnlyRootFilesystem: false
                    runAsNonRoot: false
                    privileged: true
                    capabilities:
                      add:
                        - IPC_LOCK
              volumes:
                - name: model-vol
                  hostPath:
                    path: /llm
                    type: Directory
                - emptyDir:
                    medium: Memory
                  name: shared-mem
        - name: decode
          replicas: 1
          podGroupSize: 2
          stateful: true
          template:
            metadata:
              labels:
                model.aibrix.ai/name: DeepSeek-R1
                model.aibrix.ai/port: "30000"
                model.aibrix.ai/engine: sglang
            spec:
              affinity:
                nodeAffinity:
                  requiredDuringSchedulingIgnoredDuringExecution:
                    nodeSelectorTerms:
                      - matchExpressions:
                          - key: kubernetes.io/hostname
                            operator: In
                            values:
                              - pod1-gpu-027
                              - pod1-gpu-028
                              - pod1-gpu-029
                              - pod1-gpu-030
                              - pod1-gpu-031
                              - pod1-gpu-032
              containers:
                - name: decode
                  image: 10.24.10.61:20405/sglang:v0.4.10-deepseek3.1-0822-my-re_mooncake
                  command: ["sh", "-c"]
                  args:
                    - |
                      python3 -m sglang.launch_server \
                        --model-path /llm/deepseek/DeepSeek-R1-0528-full \
                        --served-model-name DeepSeek-R1 \
                        --host 0.0.0.0 \
                        --port 30000 \
                        --disaggregation-mode decode \
                        --disaggregation-transfer-backend=mooncake \
                        --trust-remote-code \
                        --disaggregation-ib-device mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_9 \
                        --dist-init-addr "${PODSET_NAME}-0.${STORM_SERVICE_NAME}.default.svc.cluster.local:5000" \
                        --nnodes 2 \
                        --node-rank $POD_GROUP_INDEX \
                        --tp-size 16 \
                        --mem-fraction-static 0.8
                  env:
                    - name: GLOO_SOCKET_IFNAME
                      value: eth0
                    - name: NCCL_SOCKET_IFNAME
                      value: eth0
                    - name: NCCL_IB_DISABLE
                      value: "0"
                    - name: NCCL_IB_GID_INDEX
                      value: "0"
                    - name: NCCL_DEBUG
                      value: "WARN"
                    - name: TORCH_CUDA_ARCH_LIST
                      value: "9.0"
                  volumeMounts:
                    - name: model-vol
                      mountPath: /llm
                    - mountPath: /dev/shm
                      name: shared-mem
                  resources:
                    limits:
                      nvidia.com/gpu: 8
                  securityContext:
                    allowPrivilegeEscalation: true
                    readOnlyRootFilesystem: false
                    runAsNonRoot: false
                    privileged: true
                    capabilities:
                      add:
                        - IPC_LOCK
              volumes:
                - name: model-vol
                  hostPath:
                    path: /llm
                    type: Directory
                - emptyDir:
                    medium: Memory
                  name: shared-mem

Steps to Reproduce

  1. kubectl apply -f sglang-4P1D.yaml
  2. run the curl
LB_IP=$(kubectl get svc/envoy-aibrix-system-aibrix-eg-903790dc -n envoy-gateway-system -o=jsonpath='{.status.loadBalancer.ingress[0].ip}')
ENDPOINT="${LB_IP}:80"

curl -v http://${ENDPOINT}/v1/models/
curl -v http://10.24.8.71:80/v1/models/

curl -v http://${ENDPOINT}/v1/chat/completions -H "routing-strategy: pd" -H "Content-Type: application/json" -d '{
    "model": "DeepSeek-R1",
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "help me write a random generator in python"}
    ],
    "temperature": 0.7,
     "max_tokens": 10
}'

Expected behavior

get the normal output

Environment

  1. aibrix version is 2025/10/21 nightly
  2. sglang version: 0.4.10

@Jeffwan

XiaobinZhao avatar Oct 22 '25 08:10 XiaobinZhao

can you also check aibrix-gateway-plugins pod logs in aibrix-system ?

googs1025 avatar Oct 22 '25 09:10 googs1025

@googs1025

Image Image

XiaobinZhao avatar Oct 22 '25 09:10 XiaobinZhao

If I understand correctly, the pd routing-strategy no longer uses the http route crd for routing. 🤔 Is there more info? cc @Jeffwan

googs1025 avatar Oct 23 '25 08:10 googs1025

@googs1025 i dont know is the pd routing-strategy no longer uses the http route crd for routing or not.I'm new at this. What other information do you need?

XiaobinZhao avatar Oct 23 '25 08:10 XiaobinZhao

If I understand correctly, the pd routing-strategy no longer uses the http route crd for routing. 🤔 Is there more info? cc @Jeffwan

yes, that's true, if we use routing-strategy, router directly connects to pod instead of using HTTPRoute (via kubernetes service)

I think it's the service discovery problem, we use different orchestratuon approaches for PodGroupSize =1 or !=1. I will run a test to double check this case.

Jeffwan avatar Oct 23 '25 17:10 Jeffwan

Without using gateway plugins, can you confirm that the current sglang request is normal? 🤔

googs1025 avatar Oct 24 '25 01:10 googs1025

https://github.com/vllm-project/aibrix/blob/dfb5b35c97c236d2ee9322df08d6d747f6aff3ad/pkg/plugins/gateway/gateway.go#L237-L259

The code goes here, but there seems to be a problem here, because the pd strategy does not actually require httproute, which is confusing

googs1025 avatar Oct 24 '25 01:10 googs1025

If I understand correctly, the pd routing-strategy no longer uses the http route crd for routing. 🤔 Is there more info? cc @Jeffwan

yes, that's true, if we use routing-strategy, router directly connects to pod instead of using HTTPRoute (via kubernetes service)

fix this part issue: https://github.com/vllm-project/aibrix/pull/1693

It doesn't solve this whole issue

googs1025 avatar Oct 24 '25 03:10 googs1025

@googs1025

I found that the main issue is with the handling when podGroupSize=1. When podGroupSize=1, the environment variables do not have PODSET_NAME, POD_GROUP_INDEX; therefore, at this point, the value of --dist-init-addr should use the syntax from the previous version: --dist-init-addr "${ROLESET_NAME}-${ROLE_NAME}-${ROLE_TEMPLATE_HASH}-0.${STORM_SERVICE_NAME}.default.svc.cluster.local:5000"

Of course, I also made a mistake here; I did not specify the --disaggregation-mode prefill in the prefill command within the yaml.

so, your fix (#1693)[https://github.com/vllm-project/aibrix/pull/1693] considered this situation?

XiaobinZhao avatar Oct 24 '25 03:10 XiaobinZhao

@googs1025

I found that the main issue is with the handling when podGroupSize=1. When podGroupSize=1, the environment variables do not have PODSET_NAME, POD_GROUP_INDEX; therefore, at this point, the value of --dist-init-addr should use the syntax from the previous version: --dist-init-addr "${ROLESET_NAME}-${ROLE_NAME}-${ROLE_TEMPLATE_HASH}-0.${STORM_SERVICE_NAME}.default.svc.cluster.local:5000"

Of course, I also made a mistake here; I did not specify the --disaggregation-mode prefill in the prefill command within the yaml.

so, your fix (#1693)[https://github.com//pull/1693] considered this situation?

no, If you would be willing to provide this fix that would be great. 😄

googs1025 avatar Oct 24 '25 03:10 googs1025

@Jeffwan @googs1025 after i modify the --dist-init-addr in the yaml, 4P(tp=8)1D(tp16) is run success; hear the yaml as bellow:

apiVersion: orchestration.aibrix.ai/v1alpha1
kind: StormService
metadata:
  name: sglang-1p1d
spec:
  replicas: 1
  updateStrategy:
    type: InPlaceUpdate
  stateful: true
  selector:
    matchLabels:
      app: sglang-1p1d
  template:
    metadata:
      labels:
        app: sglang-1p1d
    spec:
      roles:
        - name: prefill
          replicas: 4
          podGroupSize: 1
          stateful: true
          template:
            metadata:
              labels:
                model.aibrix.ai/name: DeepSeek-R1
                model.aibrix.ai/port: "30000"
                model.aibrix.ai/engine: sglang
            spec:
              affinity:
                nodeAffinity:
                  requiredDuringSchedulingIgnoredDuringExecution:
                    nodeSelectorTerms:
                      - matchExpressions:
                          - key: kubernetes.io/hostname
                            operator: In
                            values:
                              - pod1-gpu-027
                              - pod1-gpu-028
                              - pod1-gpu-029
                              - pod1-gpu-030
                              - pod1-gpu-031
                              - pod1-gpu-032
              containers:
                - name: prefill
                  image: 10.24.10.61:20405/sglang:v0.4.10-deepseek3.1-0822-my-re_mooncake
                  command: ["sh", "-c"]
                  args:
                    - |
                      python3 -m sglang.launch_server \
                        --model-path /llm/deepseek/DeepSeek-R1-0528-full \
                        --served-model-name DeepSeek-R1 \
                        --host 0.0.0.0 \
                        --port 30000 \
                        --disaggregation-mode prefill \
                        --disaggregation-transfer-backend=mooncake \
                        --disaggregation-ib-device mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_9 \
                        --trust-remote-code \
                        --dist-init-addr "${ROLESET_NAME}-${ROLE_NAME}-${ROLE_TEMPLATE_HASH}-${ROLE_REPLICA_INDEX}.${STORM_SERVICE_NAME}.default.svc.cluster.local:5000" \
                        --nnodes 1 \
                        --tp-size 8 \
                        --node-rank 0 \
                        --mem-fraction-static 0.8
                  env:
                    - name: GLOO_SOCKET_IFNAME
                      value: eth0
                    - name: NCCL_SOCKET_IFNAME
                      value: eth0
                    - name: NCCL_IB_DISABLE
                      value: "0"
                    - name: NCCL_IB_GID_INDEX
                      value: "0"
                    - name: NCCL_DEBUG
                      value: "WARN"
                    - name: TORCH_CUDA_ARCH_LIST
                      value: "9.0"
                  volumeMounts:
                    - name: model-vol
                      mountPath: /llm
                    - mountPath: /dev/shm
                      name: shared-mem
                  resources:
                    limits:
                      nvidia.com/gpu: 8
                  securityContext:
                    allowPrivilegeEscalation: true
                    readOnlyRootFilesystem: false
                    runAsNonRoot: false
                    privileged: true
                    capabilities:
                      add:
                        - IPC_LOCK
              volumes:
                - name: model-vol
                  hostPath:
                    path: /llm
                    type: Directory
                - emptyDir:
                    medium: Memory
                  name: shared-mem
        - name: decode
          replicas: 1
          podGroupSize: 2
          stateful: true
          template:
            metadata:
              labels:
                model.aibrix.ai/name: DeepSeek-R1
                model.aibrix.ai/port: "30000"
                model.aibrix.ai/engine: sglang
            spec:
              affinity:
                nodeAffinity:
                  requiredDuringSchedulingIgnoredDuringExecution:
                    nodeSelectorTerms:
                      - matchExpressions:
                          - key: kubernetes.io/hostname
                            operator: In
                            values:
                              - pod1-gpu-027
                              - pod1-gpu-028
                              - pod1-gpu-029
                              - pod1-gpu-030
                              - pod1-gpu-031
                              - pod1-gpu-032
              containers:
                - name: decode
                  image: 10.24.10.61:20405/sglang:v0.4.10-deepseek3.1-0822-my-re_mooncake
                  command: ["sh", "-c"]
                  args:
                    - |
                      python3 -m sglang.launch_server \
                        --model-path /llm/deepseek/DeepSeek-R1-0528-full \
                        --served-model-name DeepSeek-R1 \
                        --host 0.0.0.0 \
                        --port 30000 \
                        --disaggregation-mode decode \
                        --disaggregation-transfer-backend=mooncake \
                        --trust-remote-code \
                        --disaggregation-ib-device mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_9 \
                        --dist-init-addr "${PODSET_NAME}-0.${STORM_SERVICE_NAME}.default.svc.cluster.local:5000" \
                        --nnodes 2 \
                        --node-rank $POD_GROUP_INDEX \
                        --tp-size 16 \
                        --mem-fraction-static 0.8
                  env:
                    - name: GLOO_SOCKET_IFNAME
                      value: eth0
                    - name: NCCL_SOCKET_IFNAME
                      value: eth0
                    - name: NCCL_IB_DISABLE
                      value: "0"
                    - name: NCCL_IB_GID_INDEX
                      value: "0"
                    - name: NCCL_DEBUG
                      value: "WARN"
                    - name: TORCH_CUDA_ARCH_LIST
                      value: "9.0"
                  volumeMounts:
                    - name: model-vol
                      mountPath: /llm
                    - mountPath: /dev/shm
                      name: shared-mem
                  resources:
                    limits:
                      nvidia.com/gpu: 8
                  securityContext:
                    allowPrivilegeEscalation: true
                    readOnlyRootFilesystem: false
                    runAsNonRoot: false
                    privileged: true
                    capabilities:
                      add:
                        - IPC_LOCK
              volumes:
                - name: model-vol
                  hostPath:
                    path: /llm
                    type: Directory
                - emptyDir:
                    medium: Memory
                  name: shared-mem

If aibrix can handle podGroupSize=1 and podGroupSize!=1 that would be better. Looking forward to your updates~

i close the issue

XiaobinZhao avatar Oct 24 '25 03:10 XiaobinZhao

@googs1025

I found that the main issue is with the handling when podGroupSize=1. When podGroupSize=1, the environment variables do not have PODSET_NAME, POD_GROUP_INDEX; therefore, at this point, the value of --dist-init-addr should use the syntax from the previous version: --dist-init-addr "${ROLESET_NAME}-${ROLE_NAME}-${ROLE_TEMPLATE_HASH}-0.${STORM_SERVICE_NAME}.default.svc.cluster.local:5000"

cc @Jeffwan can you also help to check this? Is this an bug or is it expected?

googs1025 avatar Oct 24 '25 05:10 googs1025

@googs1025 We can append the podSet_NAME in podGroupSize != 1 case. This is the compatibility issue. I suggest to have a fix.

Jeffwan avatar Oct 24 '25 05:10 Jeffwan

not only POD_GROUP_INDEX , but also stormservice.orchestration.aibrix.ai/pod-group-index pod label

fungaren avatar Nov 21 '25 07:11 fungaren

@fungaren what's your stormservice version?

Jeffwan avatar Nov 21 '25 08:11 Jeffwan

@Jeffwan v0.4.1

fungaren avatar Dec 09 '25 06:12 fungaren