aibrix icon indicating copy to clipboard operation
aibrix copied to clipboard

GetUtilization retrieves metrics information for pods, but the algorithm for calculating utilization is inaccurate.

Open wangchuanfang opened this issue 2 months ago • 1 comments

🚀 Feature Description and Motivation

GetUtilization retrieves metrics information for pods, but the algorithm for calculating utilization is inaccurate.

Use Case

Step 1、Create a P2D2 deepseek-r1 model. kubectl apply the following YAML file


apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: pod-read
rules:
- apiGroups:
  - ""
  resources:
  - pods
  verbs:
  - get
  - watch
  - list
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: pod-read-binding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: pod-read
subjects:
- kind: ServiceAccount
  name: default
  namespace: default   
---
apiVersion: orchestration.aibrix.ai/v1alpha1
kind: StormService
metadata:
  name: pool-xpyd
spec:
  replicas: 1
  updateStrategy:
    type: InPlaceUpdate
  stateful: true
  selector:
    matchLabels:
      app: pool-xpyd
  template:
    metadata:
      labels:
        app: pool-xpyd
    spec:
      roles:
        - name: routing
          replicas: 1
          stateful: true
          template:
            metadata:
              labels:
                app: pool-xpyd
                role: routing
                app.kubernetes.io/name: deepseek-r1-slo
                model.aibrix.ai/name: deepseek-r1
                model.aibrix.ai/port: "30000"
                model.aibrix.ai/engine: sglang                
            spec:
              containers:
                - name: mini-lb
                  # image: docker.1ms.run/aibrix/sglang-router:v0.1.6
                  image: docker.1ms.run/aibrix/sglang-router:v0.1.7-patch.1-20250731
                  # image: docker.1ms.run/aibrix/sglang-router:v0.1.9
                  # image: 172.16.106.102/sglang:v0.1.9-sgl-router-v0.3.3
                  command: [ "sh", "-c" ]
                  args:
                    - |
                      python3 -m sglang_router.launch_router \
                        --pd-disaggregation \
                        --policy round_robin \
                        --host 0.0.0.0 \
                        --service-discovery \
                        --service-discovery-port 30000 \
                        --prefill-selector storm-service-name=$STORM_SERVICE_NAME role-name=prefill stormservice.orchestration.aibrix.ai/role-replica-index=0 \
                        --decode-selector storm-service-name=$STORM_SERVICE_NAME role-name=decode stormservice.orchestration.aibrix.ai/role-replica-index=0 \
                        --service-discovery-namespace default
        - name: prefill
          replicas: 2
          stateful: true
          template:
            metadata:
              annotations:
                k8s.volcengine.com/pod-networks: |
                  [
                    {
                      "cniConf":{
                          "name":"rdma"
                      }
                    }
                  ]
              labels:
                app.kubernetes.io/name: deepseek-r1-slo
                model.aibrix.ai/name: deepseek-r1
                model.aibrix.ai/port: "30000"
                model.aibrix.ai/engine: sglang
                # model.aibrix.ai/deployment: deepseek-r1-slo
            spec:
              # nodeSelector:
              #   type: H800
              containers:
                - name: prefill
                  # image: 172.16.106.153/sglang:v0.4.9.post2-8-g10c00166-deepep.9eb2f84
                  image: 172.16.106.102/sglang:v0.5.1.post3-cu126
                  command: ["sh", "-c"]
                  args:
                    - |
                      python3 -m sglang.launch_server \
                        --model-path /data/deepseek-ai/DeepSeek-R1 \
                        --served-model-name deepseek-r1 \
                        --disaggregation-ib-device mlx5_4 \
                        --host 0.0.0.0 \
                        --port 30000 \
                        --disaggregation-mode prefill \
                        --disaggregation-transfer-backend=mooncake \
                        --trust-remote-code \
                        --dist-init-addr "${ROLESET_NAME}-${ROLE_NAME}-${ROLE_TEMPLATE_HASH}-0.${STORM_SERVICE_NAME}.default.svc.cluster.local:5000" \
                        --nnodes 2 \
                        --node-rank $ROLE_REPLICA_INDEX \
                        --tp-size 16 \
                        --page-size 1 \
                        --watchdog-timeout 1000000 \
                        --dist-timeout 250 \
                        --mem-fraction-static 0.84 \
                        --max-running-requests 512 \
                        --max-prefill-tokens 32768 \
                        --log-level debug
                  env:
                    - name: GLOO_SOCKET_IFNAME
                      value: eth0
                    - name: NCCL_SOCKET_IFNAME
                      value: eth0
                    - name: NCCL_IB_HCA
                      value: mlx5_0,mlx5_2,mlx5_3,mlx5_5
                    - name: NCCL_IB_DISABLE
                      value: "0"
                    - name: NCCL_IB_GID_INDEX
                      value: "7"
                    - name: NCCL_DEBUG
                      value: "INFO"
                    - name: MC_LOG_LEVEL
                      value: INFO
                  volumeMounts:
                    - name: model-vol
                      mountPath: /data/deepseek-ai
                    - mountPath: /dev/shm
                      name: shared-mem
                  resources:
                    requests:
                      nvidia.com/gpu: "8"
                      rdma/rdma_shared_devices: "6"
                    limits:
                      nvidia.com/gpu: "8"
                      rdma/rdma_shared_devices: "6"
                  securityContext:
                    capabilities:
                      add:
                        - IPC_LOCK
              volumes:
                - name: model-vol
                  hostPath:
                    path: /data/deepseek-ai/
                    type: Directory
                - emptyDir:
                    medium: Memory
                  name: shared-mem
        - name: decode
          replicas: 2
          stateful: true
          template:
            metadata:
              annotations:
                k8s.volcengine.com/pod-networks: |
                  [
                    {
                      "cniConf":{
                          "name":"rdma"
                      }
                    }
                  ]
              labels:
                app.kubernetes.io/name: deepseek-r1-slo
                model.aibrix.ai/name: deepseek-r1
                model.aibrix.ai/port: "30000"
                model.aibrix.ai/engine: sglang
                # model.aibrix.ai/deployment: deepseek-r1-slo
            spec:
              # nodeSelector:
              #   type: H20
              containers:
                - name: decode
                  # image: 172.16.106.153/sglang:v0.4.9.post2-8-g10c00166-deepep.9eb2f84
                  image: 172.16.106.102/sglang:v0.5.1.post3-cu126
                  command: ["sh", "-c"]
                  args:
                    - |
                      python3 -m sglang.launch_server \
                        --model-path /data/deepseek-ai/DeepSeek-R1 \
                        --served-model-name deepseek-r1 \
                        --disaggregation-ib-device mlx5_4 \
                        --host 0.0.0.0 \
                        --port 30000 \
                        --disaggregation-mode decode \
                        --disaggregation-transfer-backend=mooncake \
                        --trust-remote-code \
                        --dist-init-addr "${ROLESET_NAME}-${ROLE_NAME}-${ROLE_TEMPLATE_HASH}-0.${STORM_SERVICE_NAME}.default.svc.cluster.local:5000" \
                        --nnodes 2 \
                        --node-rank $ROLE_REPLICA_INDEX \
                        --tp-size 16 \
                        --page-size 1 \
                        --watchdog-timeout 1000000 \
                        --dist-timeout 600 \
                        --mem-fraction-static 0.84 \
                        --max-running-requests 2048 \
                        --context-length 4096 \
                        --log-level debug
                  env:
                    - name: GLOO_SOCKET_IFNAME
                      value: eth0
                    - name: NCCL_SOCKET_IFNAME
                      value: eth0
                    - name: NCCL_IB_HCA
                      value: mlx5_0,mlx5_2,mlx5_3,mlx5_5
                    - name: NCCL_IB_DISABLE
                      value: "0"
                    - name: NCCL_IB_GID_INDEX
                      value: "7"
                    - name: NCCL_DEBUG
                      value: "INFO"
                    - name: MC_LOG_LEVEL
                      value: INFO
                  volumeMounts:
                    - name: model-vol
                      mountPath: /data/deepseek-ai
                    - mountPath: /dev/shm
                      name: shared-mem
                  resources:
                    requests:
                      nvidia.com/gpu: "8"
                      rdma/rdma_shared_devices: "6"
                    limits:
                      nvidia.com/gpu: "8"
                      rdma/rdma_shared_devices: "6"
                  securityContext:
                    capabilities:
                      add:
                        - IPC_LOCK
              volumes:
                - name: model-vol
                  hostPath:
                    path: /data/deepseek-ai/
                    type: Directory
                - emptyDir:
                    medium: Memory
                  name: shared-mem

Step 2、capacity set to 3.0 func (p *PendingLoadProvider) Cap() float64 {

Step 3、Perform benchmark performance testing on inference using sglang's bench_serving.py.

Step 4、kubectl -n aibrix-system logs deployments/aibrix-gateway-plugins -f --tail 100

I0930 09:05:42.215619       1 least_load.go:118] === SLO DEBUG: Pod pool-xpyd-roleset-plmb2-decode-7fbcc7ccfb-0 exceeds capacity (3.118504 > 3.000000) ===
I0930 09:05:42.215623       1 least_load.go:118] === SLO DEBUG: Pod pool-xpyd-roleset-plmb2-decode-7fbcc7ccfb-1 exceeds capacity (3.118504 > 3.000000) ===
I0930 09:05:42.215674       1 least_load.go:118] === SLO DEBUG: Pod pool-xpyd-roleset-plmb2-prefill-595c87c74-0 exceeds capacity (3.118504 > 3.000000) ===
I0930 09:05:42.215678       1 least_load.go:118] === SLO DEBUG: Pod pool-xpyd-roleset-plmb2-prefill-595c87c74-1 exceeds capacity (3.118504 > 3.000000) ===

Step 5、Retrieve metrics information for the inference service P and D pods.

root@mooncake-master-544698dddf-nkwl2:/sgl-workspace/sglang# curl http://10.233.117.79:30000/metrics
# HELP sglang:num_retracted_reqs The number of retracted requests.
# TYPE sglang:num_retracted_reqs gauge
sglang:num_retracted_reqs{dp_rank="2",engine_type="unified",model_name="deepseek-r1",pid="912",pp_rank="0",tp_rank="2"} 0.0
sglang:num_retracted_reqs{dp_rank="0",engine_type="unified",model_name="deepseek-r1",pid="910",pp_rank="0",tp_rank="0"} 0.0
sglang:num_retracted_reqs{dp_rank="3",engine_type="unified",model_name="deepseek-r1",pid="913",pp_rank="0",tp_rank="3"} 0.0
sglang:num_retracted_reqs{dp_rank="1",engine_type="unified",model_name="deepseek-r1",pid="911",pp_rank="0",tp_rank="1"} 0.0
sglang:num_retracted_reqs{dp_rank="5",engine_type="unified",model_name="deepseek-r1",pid="915",pp_rank="0",tp_rank="5"} 0.0
sglang:num_retracted_reqs{dp_rank="4",engine_type="unified",model_name="deepseek-r1",pid="914",pp_rank="0",tp_rank="4"} 0.0
sglang:num_retracted_reqs{dp_rank="6",engine_type="unified",model_name="deepseek-r1",pid="916",pp_rank="0",tp_rank="6"} 0.0
sglang:num_retracted_reqs{dp_rank="7",engine_type="unified",model_name="deepseek-r1",pid="917",pp_rank="0",tp_rank="7"} 0.0
# HELP sglang:num_paused_reqs The number of paused requests by async weight sync.
# TYPE sglang:num_paused_reqs gauge
sglang:num_paused_reqs{dp_rank="2",engine_type="unified",model_name="deepseek-r1",pid="912",pp_rank="0",tp_rank="2"} 0.0
sglang:num_paused_reqs{dp_rank="0",engine_type="unified",model_name="deepseek-r1",pid="910",pp_rank="0",tp_rank="0"} 0.0
sglang:num_paused_reqs{dp_rank="3",engine_type="unified",model_name="deepseek-r1",pid="913",pp_rank="0",tp_rank="3"} 0.0
sglang:num_paused_reqs{dp_rank="1",engine_type="unified",model_name="deepseek-r1",pid="911",pp_rank="0",tp_rank="1"} 0.0
sglang:num_paused_reqs{dp_rank="5",engine_type="unified",model_name="deepseek-r1",pid="915",pp_rank="0",tp_rank="5"} 0.0
sglang:num_paused_reqs{dp_rank="4",engine_type="unified",model_name="deepseek-r1",pid="914",pp_rank="0",tp_rank="4"} 0.0
sglang:num_paused_reqs{dp_rank="6",engine_type="unified",model_name="deepseek-r1",pid="916",pp_rank="0",tp_rank="6"} 0.0
sglang:num_paused_reqs{dp_rank="7",engine_type="unified",model_name="deepseek-r1",pid="917",pp_rank="0",tp_rank="7"} 0.0
# HELP sglang:num_running_reqs The number of running requests.
# TYPE sglang:num_running_reqs gauge
sglang:num_running_reqs{dp_rank="1",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="1"} 1.0
sglang:num_running_reqs{dp_rank="3",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="3"} 4.0
sglang:num_running_reqs{dp_rank="0",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="0"} 1.0
sglang:num_running_reqs{dp_rank="2",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="2"} 1.0
sglang:num_running_reqs{dp_rank="4",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="4"} 4.0
sglang:num_running_reqs{dp_rank="7",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="7"} 1.0
sglang:num_running_reqs{dp_rank="6",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="6"} 1.0
sglang:num_running_reqs{dp_rank="5",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="5"} 4.0
# HELP sglang:num_used_tokens The number of used tokens.
# TYPE sglang:num_used_tokens gauge
sglang:num_used_tokens{dp_rank="1",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="1"} 0.0
sglang:num_used_tokens{dp_rank="3",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="3"} 9927.0
sglang:num_used_tokens{dp_rank="0",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="0"} 4321.0
sglang:num_used_tokens{dp_rank="2",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="2"} 2920.0
sglang:num_used_tokens{dp_rank="4",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="4"} 9480.0
sglang:num_used_tokens{dp_rank="7",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="7"} 1765.0
sglang:num_used_tokens{dp_rank="6",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="6"} 5754.0
sglang:num_used_tokens{dp_rank="5",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="5"} 6452.0
# HELP sglang:token_usage The token usage.
# TYPE sglang:token_usage gauge
sglang:token_usage{dp_rank="1",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="1"} 0.0
sglang:token_usage{dp_rank="3",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="3"} 0.03
sglang:token_usage{dp_rank="0",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="0"} 0.01
sglang:token_usage{dp_rank="2",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="2"} 0.01
sglang:token_usage{dp_rank="4",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="4"} 0.03
sglang:token_usage{dp_rank="7",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="7"} 0.01
sglang:token_usage{dp_rank="6",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="6"} 0.02
sglang:token_usage{dp_rank="5",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="5"} 0.02
# HELP sglang:swa_token_usage The token usage for SWA layers.
# TYPE sglang:swa_token_usage gauge
sglang:swa_token_usage{dp_rank="1",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="1"} 0.0
sglang:swa_token_usage{dp_rank="3",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="3"} 0.0
sglang:swa_token_usage{dp_rank="0",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="0"} 0.0
sglang:swa_token_usage{dp_rank="2",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="2"} 0.0
sglang:swa_token_usage{dp_rank="4",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="4"} 0.0
sglang:swa_token_usage{dp_rank="7",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="7"} 0.0
sglang:swa_token_usage{dp_rank="6",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="6"} 0.0
sglang:swa_token_usage{dp_rank="5",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="5"} 0.0
# HELP sglang:gen_throughput The generation throughput (token/s).
# TYPE sglang:gen_throughput gauge
sglang:gen_throughput{dp_rank="1",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="1"} 35.64174458344855
sglang:gen_throughput{dp_rank="3",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="3"} 141.7546862896204
sglang:gen_throughput{dp_rank="0",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="0"} 35.423840583853774
sglang:gen_throughput{dp_rank="2",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="2"} 35.41880810293355
sglang:gen_throughput{dp_rank="4",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="4"} 141.74408727960696
sglang:gen_throughput{dp_rank="7",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="7"} 35.43700884115304
sglang:gen_throughput{dp_rank="6",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="6"} 35.442541417324875
sglang:gen_throughput{dp_rank="5",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="5"} 140.76470568188876
# HELP sglang:num_queue_reqs The number of requests in the waiting queue.
# TYPE sglang:num_queue_reqs gauge
sglang:num_queue_reqs{dp_rank="1",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="1"} 0.0
sglang:num_queue_reqs{dp_rank="3",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="3"} 0.0
sglang:num_queue_reqs{dp_rank="0",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="0"} 0.0
sglang:num_queue_reqs{dp_rank="2",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="2"} 0.0
sglang:num_queue_reqs{dp_rank="4",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="4"} 0.0
sglang:num_queue_reqs{dp_rank="7",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="7"} 0.0
sglang:num_queue_reqs{dp_rank="6",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="6"} 0.0
sglang:num_queue_reqs{dp_rank="5",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="5"} 0.0
# HELP sglang:num_grammar_queue_reqs The number of requests in the grammar waiting queue.
# TYPE sglang:num_grammar_queue_reqs gauge
sglang:num_grammar_queue_reqs{dp_rank="1",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="1"} 0.0
sglang:num_grammar_queue_reqs{dp_rank="3",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="3"} 0.0
sglang:num_grammar_queue_reqs{dp_rank="0",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="0"} 0.0
sglang:num_grammar_queue_reqs{dp_rank="2",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="2"} 0.0
sglang:num_grammar_queue_reqs{dp_rank="4",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="4"} 0.0
sglang:num_grammar_queue_reqs{dp_rank="7",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="7"} 0.0
sglang:num_grammar_queue_reqs{dp_rank="6",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="6"} 0.0
sglang:num_grammar_queue_reqs{dp_rank="5",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="5"} 0.0
# HELP sglang:num_running_reqs_offline_batch The number of running low-priority offline batch requests(label is 'batch').
# TYPE sglang:num_running_reqs_offline_batch gauge
sglang:num_running_reqs_offline_batch{dp_rank="1",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="1"} 0.0
sglang:num_running_reqs_offline_batch{dp_rank="3",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="3"} 0.0
sglang:num_running_reqs_offline_batch{dp_rank="0",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="0"} 0.0
sglang:num_running_reqs_offline_batch{dp_rank="2",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="2"} 0.0
sglang:num_running_reqs_offline_batch{dp_rank="4",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="4"} 0.0
sglang:num_running_reqs_offline_batch{dp_rank="7",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="7"} 0.0
sglang:num_running_reqs_offline_batch{dp_rank="6",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="6"} 0.0
sglang:num_running_reqs_offline_batch{dp_rank="5",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="5"} 0.0
# HELP sglang:cache_hit_rate The prefix cache hit rate.
# TYPE sglang:cache_hit_rate gauge
sglang:cache_hit_rate{dp_rank="1",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="1"} 0.0
sglang:cache_hit_rate{dp_rank="3",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="3"} 0.0
sglang:cache_hit_rate{dp_rank="0",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="0"} 0.0
sglang:cache_hit_rate{dp_rank="2",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="2"} 0.0
sglang:cache_hit_rate{dp_rank="4",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="4"} 0.0
sglang:cache_hit_rate{dp_rank="7",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="7"} 0.0
sglang:cache_hit_rate{dp_rank="6",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="6"} 0.0
sglang:cache_hit_rate{dp_rank="5",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="5"} 0.0
# HELP sglang:spec_accept_length The average acceptance length of speculative decoding.
# TYPE sglang:spec_accept_length gauge
sglang:spec_accept_length{dp_rank="1",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="1"} 0.0
sglang:spec_accept_length{dp_rank="3",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="3"} 0.0
sglang:spec_accept_length{dp_rank="0",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="0"} 0.0
sglang:spec_accept_length{dp_rank="2",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="2"} 0.0
sglang:spec_accept_length{dp_rank="4",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="4"} 0.0
sglang:spec_accept_length{dp_rank="7",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="7"} 0.0
sglang:spec_accept_length{dp_rank="6",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="6"} 0.0
sglang:spec_accept_length{dp_rank="5",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="5"} 0.0
# HELP sglang:num_prefill_prealloc_queue_reqs The number of requests in the prefill prealloc queue.
# TYPE sglang:num_prefill_prealloc_queue_reqs gauge
sglang:num_prefill_prealloc_queue_reqs{dp_rank="1",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="1"} 0.0
sglang:num_prefill_prealloc_queue_reqs{dp_rank="3",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="3"} 0.0
sglang:num_prefill_prealloc_queue_reqs{dp_rank="0",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="0"} 0.0
sglang:num_prefill_prealloc_queue_reqs{dp_rank="2",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="2"} 0.0
sglang:num_prefill_prealloc_queue_reqs{dp_rank="4",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="4"} 0.0
sglang:num_prefill_prealloc_queue_reqs{dp_rank="7",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="7"} 0.0
sglang:num_prefill_prealloc_queue_reqs{dp_rank="6",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="6"} 0.0
sglang:num_prefill_prealloc_queue_reqs{dp_rank="5",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="5"} 0.0
# HELP sglang:num_prefill_inflight_queue_reqs The number of requests in the prefill inflight queue.
# TYPE sglang:num_prefill_inflight_queue_reqs gauge
sglang:num_prefill_inflight_queue_reqs{dp_rank="1",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="1"} 0.0
sglang:num_prefill_inflight_queue_reqs{dp_rank="3",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="3"} 0.0
sglang:num_prefill_inflight_queue_reqs{dp_rank="0",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="0"} 0.0
sglang:num_prefill_inflight_queue_reqs{dp_rank="2",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="2"} 0.0
sglang:num_prefill_inflight_queue_reqs{dp_rank="4",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="4"} 0.0
sglang:num_prefill_inflight_queue_reqs{dp_rank="7",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="7"} 0.0
sglang:num_prefill_inflight_queue_reqs{dp_rank="6",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="6"} 0.0
sglang:num_prefill_inflight_queue_reqs{dp_rank="5",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="5"} 0.0
# HELP sglang:num_decode_prealloc_queue_reqs The number of requests in the decode prealloc queue.
# TYPE sglang:num_decode_prealloc_queue_reqs gauge
sglang:num_decode_prealloc_queue_reqs{dp_rank="1",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="1"} 0.0
sglang:num_decode_prealloc_queue_reqs{dp_rank="3",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="3"} 0.0
sglang:num_decode_prealloc_queue_reqs{dp_rank="0",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="0"} 0.0
sglang:num_decode_prealloc_queue_reqs{dp_rank="2",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="2"} 0.0
sglang:num_decode_prealloc_queue_reqs{dp_rank="4",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="4"} 0.0
sglang:num_decode_prealloc_queue_reqs{dp_rank="7",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="7"} 0.0
sglang:num_decode_prealloc_queue_reqs{dp_rank="6",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="6"} 0.0
sglang:num_decode_prealloc_queue_reqs{dp_rank="5",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="5"} 0.0
# HELP sglang:num_decode_transfer_queue_reqs The number of requests in the decode transfer queue.
# TYPE sglang:num_decode_transfer_queue_reqs gauge
sglang:num_decode_transfer_queue_reqs{dp_rank="1",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="1"} 0.0
sglang:num_decode_transfer_queue_reqs{dp_rank="3",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="3"} 0.0
sglang:num_decode_transfer_queue_reqs{dp_rank="0",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="0"} 1.0
sglang:num_decode_transfer_queue_reqs{dp_rank="2",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="2"} 0.0
sglang:num_decode_transfer_queue_reqs{dp_rank="4",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="4"} 0.0
sglang:num_decode_transfer_queue_reqs{dp_rank="7",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="7"} 0.0
sglang:num_decode_transfer_queue_reqs{dp_rank="6",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="6"} 2.0
sglang:num_decode_transfer_queue_reqs{dp_rank="5",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="5"} 0.0
# HELP sglang:kv_transfer_speed_gb_s The transfer speed of the KV cache in GB/s.
# TYPE sglang:kv_transfer_speed_gb_s gauge
sglang:kv_transfer_speed_gb_s{dp_rank="1",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="1"} 0.0
sglang:kv_transfer_speed_gb_s{dp_rank="3",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="3"} 0.0
sglang:kv_transfer_speed_gb_s{dp_rank="0",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="0"} 0.0
sglang:kv_transfer_speed_gb_s{dp_rank="2",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="2"} 0.0
sglang:kv_transfer_speed_gb_s{dp_rank="4",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="4"} 0.0
sglang:kv_transfer_speed_gb_s{dp_rank="7",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="7"} 0.0
sglang:kv_transfer_speed_gb_s{dp_rank="6",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="6"} 0.0
sglang:kv_transfer_speed_gb_s{dp_rank="5",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="5"} 0.0
# HELP sglang:kv_transfer_latency_ms The transfer latency of the KV cache in ms.
# TYPE sglang:kv_transfer_latency_ms gauge
sglang:kv_transfer_latency_ms{dp_rank="1",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="1"} 0.0
sglang:kv_transfer_latency_ms{dp_rank="3",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="3"} 0.0
sglang:kv_transfer_latency_ms{dp_rank="0",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="0"} 0.0
sglang:kv_transfer_latency_ms{dp_rank="2",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="2"} 0.0
sglang:kv_transfer_latency_ms{dp_rank="4",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="4"} 0.0
sglang:kv_transfer_latency_ms{dp_rank="7",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="7"} 0.0
sglang:kv_transfer_latency_ms{dp_rank="6",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="6"} 0.0
sglang:kv_transfer_latency_ms{dp_rank="5",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="5"} 0.0
# HELP sglang:total_retracted_reqs The total number of retracted requests due to kvcache full.
# TYPE sglang:total_retracted_reqs gauge
sglang:total_retracted_reqs{dp_rank="1",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="1"} 0.0
sglang:total_retracted_reqs{dp_rank="3",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="3"} 0.0
sglang:total_retracted_reqs{dp_rank="0",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="0"} 0.0
sglang:total_retracted_reqs{dp_rank="2",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="2"} 0.0
sglang:total_retracted_reqs{dp_rank="4",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="4"} 0.0
sglang:total_retracted_reqs{dp_rank="7",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="7"} 0.0
sglang:total_retracted_reqs{dp_rank="6",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="6"} 0.0
sglang:total_retracted_reqs{dp_rank="5",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="5"} 0.0
# HELP sglang:utilization The utilization.
# TYPE sglang:utilization gauge
sglang:utilization{dp_rank="1",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="1"} 0.0
sglang:utilization{dp_rank="3",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="3"} 0.0
sglang:utilization{dp_rank="0",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="0"} 0.0
sglang:utilization{dp_rank="2",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="2"} 0.0
sglang:utilization{dp_rank="4",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="4"} 0.0
sglang:utilization{dp_rank="7",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="7"} 0.0
sglang:utilization{dp_rank="6",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="6"} 0.0
sglang:utilization{dp_rank="5",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="5"} 0.0
# HELP sglang:engine_startup_time The time taken for the engine to start up.
# TYPE sglang:engine_startup_time gauge
sglang:engine_startup_time{dp_rank="1",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="1"} 0.0
sglang:engine_startup_time{dp_rank="3",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="3"} 0.0
sglang:engine_startup_time{dp_rank="0",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="0"} 0.0
sglang:engine_startup_time{dp_rank="2",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="2"} 0.0
sglang:engine_startup_time{dp_rank="4",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="4"} 0.0
sglang:engine_startup_time{dp_rank="7",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="7"} 0.0
sglang:engine_startup_time{dp_rank="6",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="6"} 0.0
sglang:engine_startup_time{dp_rank="5",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="5"} 0.0
# HELP sglang:engine_load_weights_time The time taken for the engine to load weights.
# TYPE sglang:engine_load_weights_time gauge
sglang:engine_load_weights_time{dp_rank="1",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="1"} 0.0
sglang:engine_load_weights_time{dp_rank="3",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="3"} 0.0
sglang:engine_load_weights_time{dp_rank="0",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="0"} 0.0
sglang:engine_load_weights_time{dp_rank="2",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="2"} 0.0
sglang:engine_load_weights_time{dp_rank="4",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="4"} 0.0
sglang:engine_load_weights_time{dp_rank="7",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="7"} 0.0
sglang:engine_load_weights_time{dp_rank="6",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="6"} 0.0
sglang:engine_load_weights_time{dp_rank="5",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="5"} 0.0
# HELP sglang:time_to_first_token_seconds Histogram of time to first token in seconds.
# TYPE sglang:time_to_first_token_seconds histogram
sglang:time_to_first_token_seconds_sum{model_name="deepseek-r1"} 2414.416172027588
sglang:time_to_first_token_seconds_bucket{le="0.1",model_name="deepseek-r1"} 0.0
sglang:time_to_first_token_seconds_bucket{le="0.2",model_name="deepseek-r1"} 0.0
sglang:time_to_first_token_seconds_bucket{le="0.4",model_name="deepseek-r1"} 0.0
sglang:time_to_first_token_seconds_bucket{le="0.6",model_name="deepseek-r1"} 0.0
sglang:time_to_first_token_seconds_bucket{le="0.8",model_name="deepseek-r1"} 2.0
sglang:time_to_first_token_seconds_bucket{le="1.0",model_name="deepseek-r1"} 72.0
sglang:time_to_first_token_seconds_bucket{le="2.0",model_name="deepseek-r1"} 392.0
sglang:time_to_first_token_seconds_bucket{le="4.0",model_name="deepseek-r1"} 957.0
sglang:time_to_first_token_seconds_bucket{le="6.0",model_name="deepseek-r1"} 1006.0
sglang:time_to_first_token_seconds_bucket{le="8.0",model_name="deepseek-r1"} 1017.0
sglang:time_to_first_token_seconds_bucket{le="10.0",model_name="deepseek-r1"} 1017.0
sglang:time_to_first_token_seconds_bucket{le="20.0",model_name="deepseek-r1"} 1017.0
sglang:time_to_first_token_seconds_bucket{le="40.0",model_name="deepseek-r1"} 1017.0
sglang:time_to_first_token_seconds_bucket{le="60.0",model_name="deepseek-r1"} 1017.0
sglang:time_to_first_token_seconds_bucket{le="80.0",model_name="deepseek-r1"} 1017.0
sglang:time_to_first_token_seconds_bucket{le="100.0",model_name="deepseek-r1"} 1017.0
sglang:time_to_first_token_seconds_bucket{le="200.0",model_name="deepseek-r1"} 1017.0
sglang:time_to_first_token_seconds_bucket{le="400.0",model_name="deepseek-r1"} 1017.0
sglang:time_to_first_token_seconds_bucket{le="+Inf",model_name="deepseek-r1"} 1017.0
sglang:time_to_first_token_seconds_count{model_name="deepseek-r1"} 1017.0
# HELP sglang:e2e_request_latency_seconds Histogram of End-to-end request latency in seconds
# TYPE sglang:e2e_request_latency_seconds histogram
sglang:e2e_request_latency_seconds_sum{model_name="deepseek-r1"} 32178.985426187515
sglang:e2e_request_latency_seconds_bucket{le="0.1",model_name="deepseek-r1"} 0.0
sglang:e2e_request_latency_seconds_bucket{le="0.2",model_name="deepseek-r1"} 0.0
sglang:e2e_request_latency_seconds_bucket{le="0.4",model_name="deepseek-r1"} 0.0
sglang:e2e_request_latency_seconds_bucket{le="0.6",model_name="deepseek-r1"} 0.0
sglang:e2e_request_latency_seconds_bucket{le="0.8",model_name="deepseek-r1"} 0.0
sglang:e2e_request_latency_seconds_bucket{le="1.0",model_name="deepseek-r1"} 32.0
sglang:e2e_request_latency_seconds_bucket{le="2.0",model_name="deepseek-r1"} 36.0
sglang:e2e_request_latency_seconds_bucket{le="4.0",model_name="deepseek-r1"} 43.0
sglang:e2e_request_latency_seconds_bucket{le="6.0",model_name="deepseek-r1"} 65.0
sglang:e2e_request_latency_seconds_bucket{le="8.0",model_name="deepseek-r1"} 74.0
sglang:e2e_request_latency_seconds_bucket{le="10.0",model_name="deepseek-r1"} 94.0
sglang:e2e_request_latency_seconds_bucket{le="20.0",model_name="deepseek-r1"} 194.0
sglang:e2e_request_latency_seconds_bucket{le="40.0",model_name="deepseek-r1"} 530.0
sglang:e2e_request_latency_seconds_bucket{le="60.0",model_name="deepseek-r1"} 928.0
sglang:e2e_request_latency_seconds_bucket{le="80.0",model_name="deepseek-r1"} 953.0
sglang:e2e_request_latency_seconds_bucket{le="100.0",model_name="deepseek-r1"} 953.0
sglang:e2e_request_latency_seconds_bucket{le="200.0",model_name="deepseek-r1"} 953.0
sglang:e2e_request_latency_seconds_bucket{le="400.0",model_name="deepseek-r1"} 953.0
sglang:e2e_request_latency_seconds_bucket{le="600.0",model_name="deepseek-r1"} 953.0
sglang:e2e_request_latency_seconds_bucket{le="1200.0",model_name="deepseek-r1"} 953.0
sglang:e2e_request_latency_seconds_bucket{le="1800.0",model_name="deepseek-r1"} 953.0
sglang:e2e_request_latency_seconds_bucket{le="2400.0",model_name="deepseek-r1"} 953.0
sglang:e2e_request_latency_seconds_bucket{le="+Inf",model_name="deepseek-r1"} 953.0
sglang:e2e_request_latency_seconds_count{model_name="deepseek-r1"} 953.0
# HELP sglang:inter_token_latency_seconds Histogram of inter-token latency in seconds.
# TYPE sglang:inter_token_latency_seconds histogram
sglang:inter_token_latency_seconds_sum{model_name="deepseek-r1"} 31132.16234612465
sglang:inter_token_latency_seconds_bucket{le="0.002",model_name="deepseek-r1"} 70.0
sglang:inter_token_latency_seconds_bucket{le="0.004",model_name="deepseek-r1"} 539.0
sglang:inter_token_latency_seconds_bucket{le="0.006",model_name="deepseek-r1"} 1754.0
sglang:inter_token_latency_seconds_bucket{le="0.008",model_name="deepseek-r1"} 3943.0
sglang:inter_token_latency_seconds_bucket{le="0.01",model_name="deepseek-r1"} 6597.0
sglang:inter_token_latency_seconds_bucket{le="0.015",model_name="deepseek-r1"} 17300.0
sglang:inter_token_latency_seconds_bucket{le="0.02",model_name="deepseek-r1"} 56513.0
sglang:inter_token_latency_seconds_bucket{le="0.025",model_name="deepseek-r1"} 245738.0
sglang:inter_token_latency_seconds_bucket{le="0.03",model_name="deepseek-r1"} 706848.0
sglang:inter_token_latency_seconds_bucket{le="0.035",model_name="deepseek-r1"} 972145.0
sglang:inter_token_latency_seconds_bucket{le="0.04",model_name="deepseek-r1"} 1.028826e+06
sglang:inter_token_latency_seconds_bucket{le="0.06",model_name="deepseek-r1"} 1.046894e+06
sglang:inter_token_latency_seconds_bucket{le="0.08",model_name="deepseek-r1"} 1.047984e+06
sglang:inter_token_latency_seconds_bucket{le="0.1",model_name="deepseek-r1"} 1.048503e+06
sglang:inter_token_latency_seconds_bucket{le="0.2",model_name="deepseek-r1"} 1.048518e+06
sglang:inter_token_latency_seconds_bucket{le="0.4",model_name="deepseek-r1"} 1.049121e+06
sglang:inter_token_latency_seconds_bucket{le="0.6",model_name="deepseek-r1"} 1.049392e+06
sglang:inter_token_latency_seconds_bucket{le="0.8",model_name="deepseek-r1"} 1.050743e+06
sglang:inter_token_latency_seconds_bucket{le="1.0",model_name="deepseek-r1"} 1.051077e+06
sglang:inter_token_latency_seconds_bucket{le="2.0",model_name="deepseek-r1"} 1.051093e+06
sglang:inter_token_latency_seconds_bucket{le="4.0",model_name="deepseek-r1"} 1.051093e+06
sglang:inter_token_latency_seconds_bucket{le="6.0",model_name="deepseek-r1"} 1.051093e+06
sglang:inter_token_latency_seconds_bucket{le="8.0",model_name="deepseek-r1"} 1.051093e+06
sglang:inter_token_latency_seconds_bucket{le="+Inf",model_name="deepseek-r1"} 1.051093e+06
sglang:inter_token_latency_seconds_count{model_name="deepseek-r1"} 1.051093e+06
# HELP sglang:prompt_tokens_total Number of prefill tokens processed.
# TYPE sglang:prompt_tokens_total counter
sglang:prompt_tokens_total{model_name="deepseek-r1"} 1.475814e+06
# HELP sglang:generation_tokens_total Number of generation tokens processed.
# TYPE sglang:generation_tokens_total counter
sglang:generation_tokens_total{model_name="deepseek-r1"} 1.008213e+06
# HELP sglang:num_requests_total Number of requests processed.
# TYPE sglang:num_requests_total counter
sglang:num_requests_total{model_name="deepseek-r1"} 953.0
# HELP sglang:num_aborted_requests_total Number of requests aborted.
# TYPE sglang:num_aborted_requests_total counter
sglang:num_aborted_requests_total{model_name="deepseek-r1"} 269.0
root@mooncake-master-544698dddf-nkwl2:/sgl-workspace/sglang# 

The pod nodes observed in the inference service are not operating under high load conditions.

Proposed Solution

Can reject user requests based on the Pod's actual high load

wangchuanfang avatar Sep 30 '25 09:09 wangchuanfang

/cc @zhangjyr could you help take a look at this issue?>

Jeffwan avatar Oct 10 '25 22:10 Jeffwan