GetUtilization retrieves metrics information for pods, but the algorithm for calculating utilization is inaccurate.
🚀 Feature Description and Motivation
GetUtilization retrieves metrics information for pods, but the algorithm for calculating utilization is inaccurate.
Use Case
Step 1、Create a P2D2 deepseek-r1 model. kubectl apply the following YAML file
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: pod-read
rules:
- apiGroups:
- ""
resources:
- pods
verbs:
- get
- watch
- list
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: pod-read-binding
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: pod-read
subjects:
- kind: ServiceAccount
name: default
namespace: default
---
apiVersion: orchestration.aibrix.ai/v1alpha1
kind: StormService
metadata:
name: pool-xpyd
spec:
replicas: 1
updateStrategy:
type: InPlaceUpdate
stateful: true
selector:
matchLabels:
app: pool-xpyd
template:
metadata:
labels:
app: pool-xpyd
spec:
roles:
- name: routing
replicas: 1
stateful: true
template:
metadata:
labels:
app: pool-xpyd
role: routing
app.kubernetes.io/name: deepseek-r1-slo
model.aibrix.ai/name: deepseek-r1
model.aibrix.ai/port: "30000"
model.aibrix.ai/engine: sglang
spec:
containers:
- name: mini-lb
# image: docker.1ms.run/aibrix/sglang-router:v0.1.6
image: docker.1ms.run/aibrix/sglang-router:v0.1.7-patch.1-20250731
# image: docker.1ms.run/aibrix/sglang-router:v0.1.9
# image: 172.16.106.102/sglang:v0.1.9-sgl-router-v0.3.3
command: [ "sh", "-c" ]
args:
- |
python3 -m sglang_router.launch_router \
--pd-disaggregation \
--policy round_robin \
--host 0.0.0.0 \
--service-discovery \
--service-discovery-port 30000 \
--prefill-selector storm-service-name=$STORM_SERVICE_NAME role-name=prefill stormservice.orchestration.aibrix.ai/role-replica-index=0 \
--decode-selector storm-service-name=$STORM_SERVICE_NAME role-name=decode stormservice.orchestration.aibrix.ai/role-replica-index=0 \
--service-discovery-namespace default
- name: prefill
replicas: 2
stateful: true
template:
metadata:
annotations:
k8s.volcengine.com/pod-networks: |
[
{
"cniConf":{
"name":"rdma"
}
}
]
labels:
app.kubernetes.io/name: deepseek-r1-slo
model.aibrix.ai/name: deepseek-r1
model.aibrix.ai/port: "30000"
model.aibrix.ai/engine: sglang
# model.aibrix.ai/deployment: deepseek-r1-slo
spec:
# nodeSelector:
# type: H800
containers:
- name: prefill
# image: 172.16.106.153/sglang:v0.4.9.post2-8-g10c00166-deepep.9eb2f84
image: 172.16.106.102/sglang:v0.5.1.post3-cu126
command: ["sh", "-c"]
args:
- |
python3 -m sglang.launch_server \
--model-path /data/deepseek-ai/DeepSeek-R1 \
--served-model-name deepseek-r1 \
--disaggregation-ib-device mlx5_4 \
--host 0.0.0.0 \
--port 30000 \
--disaggregation-mode prefill \
--disaggregation-transfer-backend=mooncake \
--trust-remote-code \
--dist-init-addr "${ROLESET_NAME}-${ROLE_NAME}-${ROLE_TEMPLATE_HASH}-0.${STORM_SERVICE_NAME}.default.svc.cluster.local:5000" \
--nnodes 2 \
--node-rank $ROLE_REPLICA_INDEX \
--tp-size 16 \
--page-size 1 \
--watchdog-timeout 1000000 \
--dist-timeout 250 \
--mem-fraction-static 0.84 \
--max-running-requests 512 \
--max-prefill-tokens 32768 \
--log-level debug
env:
- name: GLOO_SOCKET_IFNAME
value: eth0
- name: NCCL_SOCKET_IFNAME
value: eth0
- name: NCCL_IB_HCA
value: mlx5_0,mlx5_2,mlx5_3,mlx5_5
- name: NCCL_IB_DISABLE
value: "0"
- name: NCCL_IB_GID_INDEX
value: "7"
- name: NCCL_DEBUG
value: "INFO"
- name: MC_LOG_LEVEL
value: INFO
volumeMounts:
- name: model-vol
mountPath: /data/deepseek-ai
- mountPath: /dev/shm
name: shared-mem
resources:
requests:
nvidia.com/gpu: "8"
rdma/rdma_shared_devices: "6"
limits:
nvidia.com/gpu: "8"
rdma/rdma_shared_devices: "6"
securityContext:
capabilities:
add:
- IPC_LOCK
volumes:
- name: model-vol
hostPath:
path: /data/deepseek-ai/
type: Directory
- emptyDir:
medium: Memory
name: shared-mem
- name: decode
replicas: 2
stateful: true
template:
metadata:
annotations:
k8s.volcengine.com/pod-networks: |
[
{
"cniConf":{
"name":"rdma"
}
}
]
labels:
app.kubernetes.io/name: deepseek-r1-slo
model.aibrix.ai/name: deepseek-r1
model.aibrix.ai/port: "30000"
model.aibrix.ai/engine: sglang
# model.aibrix.ai/deployment: deepseek-r1-slo
spec:
# nodeSelector:
# type: H20
containers:
- name: decode
# image: 172.16.106.153/sglang:v0.4.9.post2-8-g10c00166-deepep.9eb2f84
image: 172.16.106.102/sglang:v0.5.1.post3-cu126
command: ["sh", "-c"]
args:
- |
python3 -m sglang.launch_server \
--model-path /data/deepseek-ai/DeepSeek-R1 \
--served-model-name deepseek-r1 \
--disaggregation-ib-device mlx5_4 \
--host 0.0.0.0 \
--port 30000 \
--disaggregation-mode decode \
--disaggregation-transfer-backend=mooncake \
--trust-remote-code \
--dist-init-addr "${ROLESET_NAME}-${ROLE_NAME}-${ROLE_TEMPLATE_HASH}-0.${STORM_SERVICE_NAME}.default.svc.cluster.local:5000" \
--nnodes 2 \
--node-rank $ROLE_REPLICA_INDEX \
--tp-size 16 \
--page-size 1 \
--watchdog-timeout 1000000 \
--dist-timeout 600 \
--mem-fraction-static 0.84 \
--max-running-requests 2048 \
--context-length 4096 \
--log-level debug
env:
- name: GLOO_SOCKET_IFNAME
value: eth0
- name: NCCL_SOCKET_IFNAME
value: eth0
- name: NCCL_IB_HCA
value: mlx5_0,mlx5_2,mlx5_3,mlx5_5
- name: NCCL_IB_DISABLE
value: "0"
- name: NCCL_IB_GID_INDEX
value: "7"
- name: NCCL_DEBUG
value: "INFO"
- name: MC_LOG_LEVEL
value: INFO
volumeMounts:
- name: model-vol
mountPath: /data/deepseek-ai
- mountPath: /dev/shm
name: shared-mem
resources:
requests:
nvidia.com/gpu: "8"
rdma/rdma_shared_devices: "6"
limits:
nvidia.com/gpu: "8"
rdma/rdma_shared_devices: "6"
securityContext:
capabilities:
add:
- IPC_LOCK
volumes:
- name: model-vol
hostPath:
path: /data/deepseek-ai/
type: Directory
- emptyDir:
medium: Memory
name: shared-mem
Step 2、capacity set to 3.0 func (p *PendingLoadProvider) Cap() float64 {
Step 3、Perform benchmark performance testing on inference using sglang's bench_serving.py.
Step 4、kubectl -n aibrix-system logs deployments/aibrix-gateway-plugins -f --tail 100
I0930 09:05:42.215619 1 least_load.go:118] === SLO DEBUG: Pod pool-xpyd-roleset-plmb2-decode-7fbcc7ccfb-0 exceeds capacity (3.118504 > 3.000000) ===
I0930 09:05:42.215623 1 least_load.go:118] === SLO DEBUG: Pod pool-xpyd-roleset-plmb2-decode-7fbcc7ccfb-1 exceeds capacity (3.118504 > 3.000000) ===
I0930 09:05:42.215674 1 least_load.go:118] === SLO DEBUG: Pod pool-xpyd-roleset-plmb2-prefill-595c87c74-0 exceeds capacity (3.118504 > 3.000000) ===
I0930 09:05:42.215678 1 least_load.go:118] === SLO DEBUG: Pod pool-xpyd-roleset-plmb2-prefill-595c87c74-1 exceeds capacity (3.118504 > 3.000000) ===
Step 5、Retrieve metrics information for the inference service P and D pods.
root@mooncake-master-544698dddf-nkwl2:/sgl-workspace/sglang# curl http://10.233.117.79:30000/metrics
# HELP sglang:num_retracted_reqs The number of retracted requests.
# TYPE sglang:num_retracted_reqs gauge
sglang:num_retracted_reqs{dp_rank="2",engine_type="unified",model_name="deepseek-r1",pid="912",pp_rank="0",tp_rank="2"} 0.0
sglang:num_retracted_reqs{dp_rank="0",engine_type="unified",model_name="deepseek-r1",pid="910",pp_rank="0",tp_rank="0"} 0.0
sglang:num_retracted_reqs{dp_rank="3",engine_type="unified",model_name="deepseek-r1",pid="913",pp_rank="0",tp_rank="3"} 0.0
sglang:num_retracted_reqs{dp_rank="1",engine_type="unified",model_name="deepseek-r1",pid="911",pp_rank="0",tp_rank="1"} 0.0
sglang:num_retracted_reqs{dp_rank="5",engine_type="unified",model_name="deepseek-r1",pid="915",pp_rank="0",tp_rank="5"} 0.0
sglang:num_retracted_reqs{dp_rank="4",engine_type="unified",model_name="deepseek-r1",pid="914",pp_rank="0",tp_rank="4"} 0.0
sglang:num_retracted_reqs{dp_rank="6",engine_type="unified",model_name="deepseek-r1",pid="916",pp_rank="0",tp_rank="6"} 0.0
sglang:num_retracted_reqs{dp_rank="7",engine_type="unified",model_name="deepseek-r1",pid="917",pp_rank="0",tp_rank="7"} 0.0
# HELP sglang:num_paused_reqs The number of paused requests by async weight sync.
# TYPE sglang:num_paused_reqs gauge
sglang:num_paused_reqs{dp_rank="2",engine_type="unified",model_name="deepseek-r1",pid="912",pp_rank="0",tp_rank="2"} 0.0
sglang:num_paused_reqs{dp_rank="0",engine_type="unified",model_name="deepseek-r1",pid="910",pp_rank="0",tp_rank="0"} 0.0
sglang:num_paused_reqs{dp_rank="3",engine_type="unified",model_name="deepseek-r1",pid="913",pp_rank="0",tp_rank="3"} 0.0
sglang:num_paused_reqs{dp_rank="1",engine_type="unified",model_name="deepseek-r1",pid="911",pp_rank="0",tp_rank="1"} 0.0
sglang:num_paused_reqs{dp_rank="5",engine_type="unified",model_name="deepseek-r1",pid="915",pp_rank="0",tp_rank="5"} 0.0
sglang:num_paused_reqs{dp_rank="4",engine_type="unified",model_name="deepseek-r1",pid="914",pp_rank="0",tp_rank="4"} 0.0
sglang:num_paused_reqs{dp_rank="6",engine_type="unified",model_name="deepseek-r1",pid="916",pp_rank="0",tp_rank="6"} 0.0
sglang:num_paused_reqs{dp_rank="7",engine_type="unified",model_name="deepseek-r1",pid="917",pp_rank="0",tp_rank="7"} 0.0
# HELP sglang:num_running_reqs The number of running requests.
# TYPE sglang:num_running_reqs gauge
sglang:num_running_reqs{dp_rank="1",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="1"} 1.0
sglang:num_running_reqs{dp_rank="3",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="3"} 4.0
sglang:num_running_reqs{dp_rank="0",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="0"} 1.0
sglang:num_running_reqs{dp_rank="2",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="2"} 1.0
sglang:num_running_reqs{dp_rank="4",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="4"} 4.0
sglang:num_running_reqs{dp_rank="7",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="7"} 1.0
sglang:num_running_reqs{dp_rank="6",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="6"} 1.0
sglang:num_running_reqs{dp_rank="5",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="5"} 4.0
# HELP sglang:num_used_tokens The number of used tokens.
# TYPE sglang:num_used_tokens gauge
sglang:num_used_tokens{dp_rank="1",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="1"} 0.0
sglang:num_used_tokens{dp_rank="3",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="3"} 9927.0
sglang:num_used_tokens{dp_rank="0",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="0"} 4321.0
sglang:num_used_tokens{dp_rank="2",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="2"} 2920.0
sglang:num_used_tokens{dp_rank="4",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="4"} 9480.0
sglang:num_used_tokens{dp_rank="7",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="7"} 1765.0
sglang:num_used_tokens{dp_rank="6",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="6"} 5754.0
sglang:num_used_tokens{dp_rank="5",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="5"} 6452.0
# HELP sglang:token_usage The token usage.
# TYPE sglang:token_usage gauge
sglang:token_usage{dp_rank="1",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="1"} 0.0
sglang:token_usage{dp_rank="3",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="3"} 0.03
sglang:token_usage{dp_rank="0",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="0"} 0.01
sglang:token_usage{dp_rank="2",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="2"} 0.01
sglang:token_usage{dp_rank="4",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="4"} 0.03
sglang:token_usage{dp_rank="7",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="7"} 0.01
sglang:token_usage{dp_rank="6",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="6"} 0.02
sglang:token_usage{dp_rank="5",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="5"} 0.02
# HELP sglang:swa_token_usage The token usage for SWA layers.
# TYPE sglang:swa_token_usage gauge
sglang:swa_token_usage{dp_rank="1",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="1"} 0.0
sglang:swa_token_usage{dp_rank="3",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="3"} 0.0
sglang:swa_token_usage{dp_rank="0",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="0"} 0.0
sglang:swa_token_usage{dp_rank="2",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="2"} 0.0
sglang:swa_token_usage{dp_rank="4",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="4"} 0.0
sglang:swa_token_usage{dp_rank="7",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="7"} 0.0
sglang:swa_token_usage{dp_rank="6",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="6"} 0.0
sglang:swa_token_usage{dp_rank="5",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="5"} 0.0
# HELP sglang:gen_throughput The generation throughput (token/s).
# TYPE sglang:gen_throughput gauge
sglang:gen_throughput{dp_rank="1",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="1"} 35.64174458344855
sglang:gen_throughput{dp_rank="3",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="3"} 141.7546862896204
sglang:gen_throughput{dp_rank="0",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="0"} 35.423840583853774
sglang:gen_throughput{dp_rank="2",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="2"} 35.41880810293355
sglang:gen_throughput{dp_rank="4",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="4"} 141.74408727960696
sglang:gen_throughput{dp_rank="7",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="7"} 35.43700884115304
sglang:gen_throughput{dp_rank="6",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="6"} 35.442541417324875
sglang:gen_throughput{dp_rank="5",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="5"} 140.76470568188876
# HELP sglang:num_queue_reqs The number of requests in the waiting queue.
# TYPE sglang:num_queue_reqs gauge
sglang:num_queue_reqs{dp_rank="1",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="1"} 0.0
sglang:num_queue_reqs{dp_rank="3",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="3"} 0.0
sglang:num_queue_reqs{dp_rank="0",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="0"} 0.0
sglang:num_queue_reqs{dp_rank="2",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="2"} 0.0
sglang:num_queue_reqs{dp_rank="4",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="4"} 0.0
sglang:num_queue_reqs{dp_rank="7",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="7"} 0.0
sglang:num_queue_reqs{dp_rank="6",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="6"} 0.0
sglang:num_queue_reqs{dp_rank="5",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="5"} 0.0
# HELP sglang:num_grammar_queue_reqs The number of requests in the grammar waiting queue.
# TYPE sglang:num_grammar_queue_reqs gauge
sglang:num_grammar_queue_reqs{dp_rank="1",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="1"} 0.0
sglang:num_grammar_queue_reqs{dp_rank="3",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="3"} 0.0
sglang:num_grammar_queue_reqs{dp_rank="0",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="0"} 0.0
sglang:num_grammar_queue_reqs{dp_rank="2",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="2"} 0.0
sglang:num_grammar_queue_reqs{dp_rank="4",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="4"} 0.0
sglang:num_grammar_queue_reqs{dp_rank="7",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="7"} 0.0
sglang:num_grammar_queue_reqs{dp_rank="6",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="6"} 0.0
sglang:num_grammar_queue_reqs{dp_rank="5",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="5"} 0.0
# HELP sglang:num_running_reqs_offline_batch The number of running low-priority offline batch requests(label is 'batch').
# TYPE sglang:num_running_reqs_offline_batch gauge
sglang:num_running_reqs_offline_batch{dp_rank="1",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="1"} 0.0
sglang:num_running_reqs_offline_batch{dp_rank="3",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="3"} 0.0
sglang:num_running_reqs_offline_batch{dp_rank="0",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="0"} 0.0
sglang:num_running_reqs_offline_batch{dp_rank="2",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="2"} 0.0
sglang:num_running_reqs_offline_batch{dp_rank="4",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="4"} 0.0
sglang:num_running_reqs_offline_batch{dp_rank="7",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="7"} 0.0
sglang:num_running_reqs_offline_batch{dp_rank="6",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="6"} 0.0
sglang:num_running_reqs_offline_batch{dp_rank="5",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="5"} 0.0
# HELP sglang:cache_hit_rate The prefix cache hit rate.
# TYPE sglang:cache_hit_rate gauge
sglang:cache_hit_rate{dp_rank="1",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="1"} 0.0
sglang:cache_hit_rate{dp_rank="3",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="3"} 0.0
sglang:cache_hit_rate{dp_rank="0",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="0"} 0.0
sglang:cache_hit_rate{dp_rank="2",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="2"} 0.0
sglang:cache_hit_rate{dp_rank="4",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="4"} 0.0
sglang:cache_hit_rate{dp_rank="7",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="7"} 0.0
sglang:cache_hit_rate{dp_rank="6",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="6"} 0.0
sglang:cache_hit_rate{dp_rank="5",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="5"} 0.0
# HELP sglang:spec_accept_length The average acceptance length of speculative decoding.
# TYPE sglang:spec_accept_length gauge
sglang:spec_accept_length{dp_rank="1",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="1"} 0.0
sglang:spec_accept_length{dp_rank="3",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="3"} 0.0
sglang:spec_accept_length{dp_rank="0",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="0"} 0.0
sglang:spec_accept_length{dp_rank="2",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="2"} 0.0
sglang:spec_accept_length{dp_rank="4",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="4"} 0.0
sglang:spec_accept_length{dp_rank="7",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="7"} 0.0
sglang:spec_accept_length{dp_rank="6",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="6"} 0.0
sglang:spec_accept_length{dp_rank="5",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="5"} 0.0
# HELP sglang:num_prefill_prealloc_queue_reqs The number of requests in the prefill prealloc queue.
# TYPE sglang:num_prefill_prealloc_queue_reqs gauge
sglang:num_prefill_prealloc_queue_reqs{dp_rank="1",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="1"} 0.0
sglang:num_prefill_prealloc_queue_reqs{dp_rank="3",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="3"} 0.0
sglang:num_prefill_prealloc_queue_reqs{dp_rank="0",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="0"} 0.0
sglang:num_prefill_prealloc_queue_reqs{dp_rank="2",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="2"} 0.0
sglang:num_prefill_prealloc_queue_reqs{dp_rank="4",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="4"} 0.0
sglang:num_prefill_prealloc_queue_reqs{dp_rank="7",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="7"} 0.0
sglang:num_prefill_prealloc_queue_reqs{dp_rank="6",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="6"} 0.0
sglang:num_prefill_prealloc_queue_reqs{dp_rank="5",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="5"} 0.0
# HELP sglang:num_prefill_inflight_queue_reqs The number of requests in the prefill inflight queue.
# TYPE sglang:num_prefill_inflight_queue_reqs gauge
sglang:num_prefill_inflight_queue_reqs{dp_rank="1",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="1"} 0.0
sglang:num_prefill_inflight_queue_reqs{dp_rank="3",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="3"} 0.0
sglang:num_prefill_inflight_queue_reqs{dp_rank="0",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="0"} 0.0
sglang:num_prefill_inflight_queue_reqs{dp_rank="2",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="2"} 0.0
sglang:num_prefill_inflight_queue_reqs{dp_rank="4",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="4"} 0.0
sglang:num_prefill_inflight_queue_reqs{dp_rank="7",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="7"} 0.0
sglang:num_prefill_inflight_queue_reqs{dp_rank="6",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="6"} 0.0
sglang:num_prefill_inflight_queue_reqs{dp_rank="5",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="5"} 0.0
# HELP sglang:num_decode_prealloc_queue_reqs The number of requests in the decode prealloc queue.
# TYPE sglang:num_decode_prealloc_queue_reqs gauge
sglang:num_decode_prealloc_queue_reqs{dp_rank="1",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="1"} 0.0
sglang:num_decode_prealloc_queue_reqs{dp_rank="3",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="3"} 0.0
sglang:num_decode_prealloc_queue_reqs{dp_rank="0",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="0"} 0.0
sglang:num_decode_prealloc_queue_reqs{dp_rank="2",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="2"} 0.0
sglang:num_decode_prealloc_queue_reqs{dp_rank="4",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="4"} 0.0
sglang:num_decode_prealloc_queue_reqs{dp_rank="7",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="7"} 0.0
sglang:num_decode_prealloc_queue_reqs{dp_rank="6",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="6"} 0.0
sglang:num_decode_prealloc_queue_reqs{dp_rank="5",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="5"} 0.0
# HELP sglang:num_decode_transfer_queue_reqs The number of requests in the decode transfer queue.
# TYPE sglang:num_decode_transfer_queue_reqs gauge
sglang:num_decode_transfer_queue_reqs{dp_rank="1",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="1"} 0.0
sglang:num_decode_transfer_queue_reqs{dp_rank="3",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="3"} 0.0
sglang:num_decode_transfer_queue_reqs{dp_rank="0",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="0"} 1.0
sglang:num_decode_transfer_queue_reqs{dp_rank="2",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="2"} 0.0
sglang:num_decode_transfer_queue_reqs{dp_rank="4",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="4"} 0.0
sglang:num_decode_transfer_queue_reqs{dp_rank="7",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="7"} 0.0
sglang:num_decode_transfer_queue_reqs{dp_rank="6",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="6"} 2.0
sglang:num_decode_transfer_queue_reqs{dp_rank="5",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="5"} 0.0
# HELP sglang:kv_transfer_speed_gb_s The transfer speed of the KV cache in GB/s.
# TYPE sglang:kv_transfer_speed_gb_s gauge
sglang:kv_transfer_speed_gb_s{dp_rank="1",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="1"} 0.0
sglang:kv_transfer_speed_gb_s{dp_rank="3",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="3"} 0.0
sglang:kv_transfer_speed_gb_s{dp_rank="0",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="0"} 0.0
sglang:kv_transfer_speed_gb_s{dp_rank="2",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="2"} 0.0
sglang:kv_transfer_speed_gb_s{dp_rank="4",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="4"} 0.0
sglang:kv_transfer_speed_gb_s{dp_rank="7",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="7"} 0.0
sglang:kv_transfer_speed_gb_s{dp_rank="6",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="6"} 0.0
sglang:kv_transfer_speed_gb_s{dp_rank="5",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="5"} 0.0
# HELP sglang:kv_transfer_latency_ms The transfer latency of the KV cache in ms.
# TYPE sglang:kv_transfer_latency_ms gauge
sglang:kv_transfer_latency_ms{dp_rank="1",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="1"} 0.0
sglang:kv_transfer_latency_ms{dp_rank="3",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="3"} 0.0
sglang:kv_transfer_latency_ms{dp_rank="0",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="0"} 0.0
sglang:kv_transfer_latency_ms{dp_rank="2",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="2"} 0.0
sglang:kv_transfer_latency_ms{dp_rank="4",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="4"} 0.0
sglang:kv_transfer_latency_ms{dp_rank="7",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="7"} 0.0
sglang:kv_transfer_latency_ms{dp_rank="6",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="6"} 0.0
sglang:kv_transfer_latency_ms{dp_rank="5",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="5"} 0.0
# HELP sglang:total_retracted_reqs The total number of retracted requests due to kvcache full.
# TYPE sglang:total_retracted_reqs gauge
sglang:total_retracted_reqs{dp_rank="1",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="1"} 0.0
sglang:total_retracted_reqs{dp_rank="3",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="3"} 0.0
sglang:total_retracted_reqs{dp_rank="0",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="0"} 0.0
sglang:total_retracted_reqs{dp_rank="2",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="2"} 0.0
sglang:total_retracted_reqs{dp_rank="4",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="4"} 0.0
sglang:total_retracted_reqs{dp_rank="7",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="7"} 0.0
sglang:total_retracted_reqs{dp_rank="6",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="6"} 0.0
sglang:total_retracted_reqs{dp_rank="5",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="5"} 0.0
# HELP sglang:utilization The utilization.
# TYPE sglang:utilization gauge
sglang:utilization{dp_rank="1",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="1"} 0.0
sglang:utilization{dp_rank="3",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="3"} 0.0
sglang:utilization{dp_rank="0",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="0"} 0.0
sglang:utilization{dp_rank="2",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="2"} 0.0
sglang:utilization{dp_rank="4",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="4"} 0.0
sglang:utilization{dp_rank="7",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="7"} 0.0
sglang:utilization{dp_rank="6",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="6"} 0.0
sglang:utilization{dp_rank="5",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="5"} 0.0
# HELP sglang:engine_startup_time The time taken for the engine to start up.
# TYPE sglang:engine_startup_time gauge
sglang:engine_startup_time{dp_rank="1",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="1"} 0.0
sglang:engine_startup_time{dp_rank="3",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="3"} 0.0
sglang:engine_startup_time{dp_rank="0",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="0"} 0.0
sglang:engine_startup_time{dp_rank="2",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="2"} 0.0
sglang:engine_startup_time{dp_rank="4",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="4"} 0.0
sglang:engine_startup_time{dp_rank="7",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="7"} 0.0
sglang:engine_startup_time{dp_rank="6",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="6"} 0.0
sglang:engine_startup_time{dp_rank="5",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="5"} 0.0
# HELP sglang:engine_load_weights_time The time taken for the engine to load weights.
# TYPE sglang:engine_load_weights_time gauge
sglang:engine_load_weights_time{dp_rank="1",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="1"} 0.0
sglang:engine_load_weights_time{dp_rank="3",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="3"} 0.0
sglang:engine_load_weights_time{dp_rank="0",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="0"} 0.0
sglang:engine_load_weights_time{dp_rank="2",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="2"} 0.0
sglang:engine_load_weights_time{dp_rank="4",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="4"} 0.0
sglang:engine_load_weights_time{dp_rank="7",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="7"} 0.0
sglang:engine_load_weights_time{dp_rank="6",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="6"} 0.0
sglang:engine_load_weights_time{dp_rank="5",engine_type="unified",model_name="deepseek-r1",pp_rank="0",tp_rank="5"} 0.0
# HELP sglang:time_to_first_token_seconds Histogram of time to first token in seconds.
# TYPE sglang:time_to_first_token_seconds histogram
sglang:time_to_first_token_seconds_sum{model_name="deepseek-r1"} 2414.416172027588
sglang:time_to_first_token_seconds_bucket{le="0.1",model_name="deepseek-r1"} 0.0
sglang:time_to_first_token_seconds_bucket{le="0.2",model_name="deepseek-r1"} 0.0
sglang:time_to_first_token_seconds_bucket{le="0.4",model_name="deepseek-r1"} 0.0
sglang:time_to_first_token_seconds_bucket{le="0.6",model_name="deepseek-r1"} 0.0
sglang:time_to_first_token_seconds_bucket{le="0.8",model_name="deepseek-r1"} 2.0
sglang:time_to_first_token_seconds_bucket{le="1.0",model_name="deepseek-r1"} 72.0
sglang:time_to_first_token_seconds_bucket{le="2.0",model_name="deepseek-r1"} 392.0
sglang:time_to_first_token_seconds_bucket{le="4.0",model_name="deepseek-r1"} 957.0
sglang:time_to_first_token_seconds_bucket{le="6.0",model_name="deepseek-r1"} 1006.0
sglang:time_to_first_token_seconds_bucket{le="8.0",model_name="deepseek-r1"} 1017.0
sglang:time_to_first_token_seconds_bucket{le="10.0",model_name="deepseek-r1"} 1017.0
sglang:time_to_first_token_seconds_bucket{le="20.0",model_name="deepseek-r1"} 1017.0
sglang:time_to_first_token_seconds_bucket{le="40.0",model_name="deepseek-r1"} 1017.0
sglang:time_to_first_token_seconds_bucket{le="60.0",model_name="deepseek-r1"} 1017.0
sglang:time_to_first_token_seconds_bucket{le="80.0",model_name="deepseek-r1"} 1017.0
sglang:time_to_first_token_seconds_bucket{le="100.0",model_name="deepseek-r1"} 1017.0
sglang:time_to_first_token_seconds_bucket{le="200.0",model_name="deepseek-r1"} 1017.0
sglang:time_to_first_token_seconds_bucket{le="400.0",model_name="deepseek-r1"} 1017.0
sglang:time_to_first_token_seconds_bucket{le="+Inf",model_name="deepseek-r1"} 1017.0
sglang:time_to_first_token_seconds_count{model_name="deepseek-r1"} 1017.0
# HELP sglang:e2e_request_latency_seconds Histogram of End-to-end request latency in seconds
# TYPE sglang:e2e_request_latency_seconds histogram
sglang:e2e_request_latency_seconds_sum{model_name="deepseek-r1"} 32178.985426187515
sglang:e2e_request_latency_seconds_bucket{le="0.1",model_name="deepseek-r1"} 0.0
sglang:e2e_request_latency_seconds_bucket{le="0.2",model_name="deepseek-r1"} 0.0
sglang:e2e_request_latency_seconds_bucket{le="0.4",model_name="deepseek-r1"} 0.0
sglang:e2e_request_latency_seconds_bucket{le="0.6",model_name="deepseek-r1"} 0.0
sglang:e2e_request_latency_seconds_bucket{le="0.8",model_name="deepseek-r1"} 0.0
sglang:e2e_request_latency_seconds_bucket{le="1.0",model_name="deepseek-r1"} 32.0
sglang:e2e_request_latency_seconds_bucket{le="2.0",model_name="deepseek-r1"} 36.0
sglang:e2e_request_latency_seconds_bucket{le="4.0",model_name="deepseek-r1"} 43.0
sglang:e2e_request_latency_seconds_bucket{le="6.0",model_name="deepseek-r1"} 65.0
sglang:e2e_request_latency_seconds_bucket{le="8.0",model_name="deepseek-r1"} 74.0
sglang:e2e_request_latency_seconds_bucket{le="10.0",model_name="deepseek-r1"} 94.0
sglang:e2e_request_latency_seconds_bucket{le="20.0",model_name="deepseek-r1"} 194.0
sglang:e2e_request_latency_seconds_bucket{le="40.0",model_name="deepseek-r1"} 530.0
sglang:e2e_request_latency_seconds_bucket{le="60.0",model_name="deepseek-r1"} 928.0
sglang:e2e_request_latency_seconds_bucket{le="80.0",model_name="deepseek-r1"} 953.0
sglang:e2e_request_latency_seconds_bucket{le="100.0",model_name="deepseek-r1"} 953.0
sglang:e2e_request_latency_seconds_bucket{le="200.0",model_name="deepseek-r1"} 953.0
sglang:e2e_request_latency_seconds_bucket{le="400.0",model_name="deepseek-r1"} 953.0
sglang:e2e_request_latency_seconds_bucket{le="600.0",model_name="deepseek-r1"} 953.0
sglang:e2e_request_latency_seconds_bucket{le="1200.0",model_name="deepseek-r1"} 953.0
sglang:e2e_request_latency_seconds_bucket{le="1800.0",model_name="deepseek-r1"} 953.0
sglang:e2e_request_latency_seconds_bucket{le="2400.0",model_name="deepseek-r1"} 953.0
sglang:e2e_request_latency_seconds_bucket{le="+Inf",model_name="deepseek-r1"} 953.0
sglang:e2e_request_latency_seconds_count{model_name="deepseek-r1"} 953.0
# HELP sglang:inter_token_latency_seconds Histogram of inter-token latency in seconds.
# TYPE sglang:inter_token_latency_seconds histogram
sglang:inter_token_latency_seconds_sum{model_name="deepseek-r1"} 31132.16234612465
sglang:inter_token_latency_seconds_bucket{le="0.002",model_name="deepseek-r1"} 70.0
sglang:inter_token_latency_seconds_bucket{le="0.004",model_name="deepseek-r1"} 539.0
sglang:inter_token_latency_seconds_bucket{le="0.006",model_name="deepseek-r1"} 1754.0
sglang:inter_token_latency_seconds_bucket{le="0.008",model_name="deepseek-r1"} 3943.0
sglang:inter_token_latency_seconds_bucket{le="0.01",model_name="deepseek-r1"} 6597.0
sglang:inter_token_latency_seconds_bucket{le="0.015",model_name="deepseek-r1"} 17300.0
sglang:inter_token_latency_seconds_bucket{le="0.02",model_name="deepseek-r1"} 56513.0
sglang:inter_token_latency_seconds_bucket{le="0.025",model_name="deepseek-r1"} 245738.0
sglang:inter_token_latency_seconds_bucket{le="0.03",model_name="deepseek-r1"} 706848.0
sglang:inter_token_latency_seconds_bucket{le="0.035",model_name="deepseek-r1"} 972145.0
sglang:inter_token_latency_seconds_bucket{le="0.04",model_name="deepseek-r1"} 1.028826e+06
sglang:inter_token_latency_seconds_bucket{le="0.06",model_name="deepseek-r1"} 1.046894e+06
sglang:inter_token_latency_seconds_bucket{le="0.08",model_name="deepseek-r1"} 1.047984e+06
sglang:inter_token_latency_seconds_bucket{le="0.1",model_name="deepseek-r1"} 1.048503e+06
sglang:inter_token_latency_seconds_bucket{le="0.2",model_name="deepseek-r1"} 1.048518e+06
sglang:inter_token_latency_seconds_bucket{le="0.4",model_name="deepseek-r1"} 1.049121e+06
sglang:inter_token_latency_seconds_bucket{le="0.6",model_name="deepseek-r1"} 1.049392e+06
sglang:inter_token_latency_seconds_bucket{le="0.8",model_name="deepseek-r1"} 1.050743e+06
sglang:inter_token_latency_seconds_bucket{le="1.0",model_name="deepseek-r1"} 1.051077e+06
sglang:inter_token_latency_seconds_bucket{le="2.0",model_name="deepseek-r1"} 1.051093e+06
sglang:inter_token_latency_seconds_bucket{le="4.0",model_name="deepseek-r1"} 1.051093e+06
sglang:inter_token_latency_seconds_bucket{le="6.0",model_name="deepseek-r1"} 1.051093e+06
sglang:inter_token_latency_seconds_bucket{le="8.0",model_name="deepseek-r1"} 1.051093e+06
sglang:inter_token_latency_seconds_bucket{le="+Inf",model_name="deepseek-r1"} 1.051093e+06
sglang:inter_token_latency_seconds_count{model_name="deepseek-r1"} 1.051093e+06
# HELP sglang:prompt_tokens_total Number of prefill tokens processed.
# TYPE sglang:prompt_tokens_total counter
sglang:prompt_tokens_total{model_name="deepseek-r1"} 1.475814e+06
# HELP sglang:generation_tokens_total Number of generation tokens processed.
# TYPE sglang:generation_tokens_total counter
sglang:generation_tokens_total{model_name="deepseek-r1"} 1.008213e+06
# HELP sglang:num_requests_total Number of requests processed.
# TYPE sglang:num_requests_total counter
sglang:num_requests_total{model_name="deepseek-r1"} 953.0
# HELP sglang:num_aborted_requests_total Number of requests aborted.
# TYPE sglang:num_aborted_requests_total counter
sglang:num_aborted_requests_total{model_name="deepseek-r1"} 269.0
root@mooncake-master-544698dddf-nkwl2:/sgl-workspace/sglang#
The pod nodes observed in the inference service are not operating under high load conditions.
Proposed Solution
Can reject user requests based on the Pod's actual high load
/cc @zhangjyr could you help take a look at this issue?>