KPA does not scale down after scaling up -> took too long
🐛 Describe the bug
KPA never scales down after scaling up. Scaling up works but scaling down never happens even when there is 0 load, basically gpu_cache_usage_perc is 0.
Expected behavior: it should scale down to minReplica.
Currentn behavior: it does not scale down.
KPA podautoscaler
apiVersion: autoscaling.aibrix.ai/v1alpha1
kind: PodAutoscaler
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"autoscaling.aibrix.ai/v1alpha1","kind":"PodAutoscaler","metadata":{"annotations":{},"labels":{"app.kubernetes.io/managed-by":"kustomize","app.kubernetes.io/name":"aibrix"},"name":"podautoscaler-aibrix-model-deepseek-llm-7b-chat-kpa","namespace":"default"},"spec":{"maxReplicas":10,"metricsSources":[{"metricSourceType":"pod","path":"metrics","port":"8000","protocolType":"http","targetMetric":"gpu_cache_usage_perc","targetValue":"0.5"}],"minReplicas":1,"scaleTargetRef":{"apiVersion":"apps/v1","kind":"Deployment","name":"aibrix-model-deepseek-llm-7b-chat"},"scalingStrategy":"KPA"}}
creationTimestamp: "2025-01-23T06:25:26Z"
generation: 3
labels:
app.kubernetes.io/managed-by: kustomize
app.kubernetes.io/name: aibrix
name: podautoscaler-aibrix-model-deepseek-llm-7b-chat-kpa
namespace: default
resourceVersion: "102456604"
uid: dc0647f3-c2af-4352-9db2-a6cbc11d3680
spec:
maxReplicas: 10
metricsSources:
- metricSourceType: pod
path: metrics
port: "8000"
protocolType: http
targetMetric: gpu_cache_usage_perc
targetValue: "0.5"
minReplicas: 1
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: aibrix-model-deepseek-llm-7b-chat
scalingStrategy: KPA
status:
conditions:
- lastTransitionTime: "2025-01-23T06:25:26Z"
message: the KPA controller was able to get the target's current scale
reason: SucceededGetScale
status: "True"
type: AbleToScale
deployment
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "4"
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"apps/v1","kind":"Deployment","metadata":{"annotations":{},"labels":{"model.aibrix.ai/name":"deepseek-llm-7b-chat","model.aibrix.ai/port":"8000"},"name":"aibrix-model-deepseek-llm-7b-chat","namespace":"default"},"spec":{"replicas":1,"selector":{"matchLabels":{"model.aibrix.ai/name":"deepseek-llm-7b-chat"}},"strategy":{"type":"Recreate"},"template":{"metadata":{"annotations":{"prometheus.io/path":"/metrics","prometheus.io/port":"8000","prometheus.io/scrape":"true"},"labels":{"model.aibrix.ai/name":"deepseek-llm-7b-chat"}},"spec":{"affinity":{"nodeAffinity":{"requiredDuringSchedulingIgnoredDuringExecution":{"nodeSelectorTerms":[{"matchExpressions":[{"key":"machine.cluster.vke.volcengine.com/gpu-name","operator":"In","values":["Tesla-V100"]}]}]}}},"containers":[{"command":["python3","-m","vllm.entrypoints.openai.api_server","--host","0.0.0.0","--port","8000","--model","/models/deepseek-llm-7b-chat","--served-model-name","deepseek-llm-7b-chat","--trust-remote-code","--api-key","xxxx","--dtype","half"],"image":"aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/vllm-openai:v0.6.2-distributed","imagePullPolicy":"IfNotPresent","livenessProbe":{"failureThreshold":3,"httpGet":{"path":"/health","port":8000,"scheme":"HTTP"},"initialDelaySeconds":90,"periodSeconds":5,"successThreshold":1,"timeoutSeconds":1},"name":"vllm-openai","ports":[{"containerPort":8000,"protocol":"TCP"}],"readinessProbe":{"failureThreshold":3,"httpGet":{"path":"/health","port":8000,"scheme":"HTTP"},"initialDelaySeconds":90,"periodSeconds":5,"successThreshold":1,"timeoutSeconds":1},"resources":{"limits":{"nvidia.com/gpu":"1"},"requests":{"nvidia.com/gpu":"1"}},"volumeMounts":[{"mountPath":"/models","name":"model-hostpath"},{"mountPath":"/dev/shm","name":"dshm"}]},{"command":["aibrix_runtime","--port","8080"],"env":[{"name":"INFERENCE_ENGINE","value":"vllm"},{"name":"INFERENCE_ENGINE_ENDPOINT","value":"http://localhost:8000"}],"image":"aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/runtime:v0.2.0-rc.1","livenessProbe":{"httpGet":{"path":"/healthz","port":8080},"initialDelaySeconds":3,"periodSeconds":2},"name":"aibrix-runtime","ports":[{"containerPort":8080,"protocol":"TCP"}],"readinessProbe":{"httpGet":{"path":"/ready","port":8080},"initialDelaySeconds":5,"periodSeconds":10},"volumeMounts":[{"mountPath":"/models","name":"model-hostpath"}]}],"initContainers":[{"command":["aibrix_download","--model-uri","tos://aibrix-artifact-testing/models/deepseek-llm-7b-chat/","--local-dir","/models/"],"env":[{"name":"DOWNLOADER_MODEL_NAME","value":"deepseek-llm-7b-chat"},{"name":"DOWNLOADER_NUM_THREADS","value":"16"},{"name":"DOWNLOADER_ALLOW_FILE_SUFFIX","value":"json, safetensors, bin"},{"name":"TOS_ACCESS_KEY","valueFrom":{"secretKeyRef":{"key":"TOS_ACCESS_KEY","name":"tos-credential"}}},{"name":"TOS_SECRET_KEY","valueFrom":{"secretKeyRef":{"key":"TOS_SECRET_KEY","name":"tos-credential"}}},{"name":"TOS_ENDPOINT","value":"tos-cn-beijing.ivolces.com"},{"name":"TOS_REGION","value":"cn-beijing"}],"image":"aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/runtime:v0.1.1","name":"init-model","volumeMounts":[{"mountPath":"/models","name":"model-hostpath"}]}],"volumes":[{"hostPath":{"path":"/root/models","type":"DirectoryOrCreate"},"name":"model-hostpath"},{"emptyDir":{"medium":"Memory","sizeLimit":"4Gi"},"name":"dshm"}]}}}}
creationTimestamp: "2025-01-22T22:01:21Z"
generation: 13
labels:
model.aibrix.ai/name: deepseek-llm-7b-chat
model.aibrix.ai/port: "8000"
name: aibrix-model-deepseek-llm-7b-chat
namespace: default
resourceVersion: "102924827"
uid: 83176d98-18d8-492e-a726-cd8ddc75bac3
spec:
progressDeadlineSeconds: 600
**replicas: 7** # this part
revisionHistoryLimit: 10
selector:
matchLabels:
model.aibrix.ai/name: deepseek-llm-7b-chat
strategy:
type: Recreate
template:
metadata:
annotations:
prometheus.io/path: /metrics
prometheus.io/port: "8000"
prometheus.io/scrape: "true"
creationTimestamp: null
labels:
model.aibrix.ai/name: deepseek-llm-7b-chat
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: machine.cluster.vke.volcengine.com/gpu-name
operator: In
values:
- Tesla-V100
containers:
- command:
- python3
- -m
- vllm.entrypoints.openai.api_server
- --host
- 0.0.0.0
- --port
- "8000"
- --model
- /models/deepseek-llm-7b-chat
- --served-model-name
- deepseek-llm-7b-chat
- --trust-remote-code
- --api-key
- xxxx
- --dtype
- half
image: aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/vllm-openai:v0.6.2-distributed
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
path: /health
port: 8000
scheme: HTTP
initialDelaySeconds: 90
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
name: vllm-openai
ports:
- containerPort: 8000
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /health
port: 8000
scheme: HTTP
initialDelaySeconds: 90
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
nvidia.com/gpu: "1"
requests:
nvidia.com/gpu: "1"
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /models
name: model-hostpath
- mountPath: /dev/shm
name: dshm
- command:
- aibrix_runtime
- --port
- "8080"
env:
- name: INFERENCE_ENGINE
value: vllm
- name: INFERENCE_ENGINE_ENDPOINT
value: http://localhost:8000
image: aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/runtime:v0.2.0-rc.1
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
path: /healthz
port: 8080
scheme: HTTP
initialDelaySeconds: 3
periodSeconds: 2
successThreshold: 1
timeoutSeconds: 1
name: aibrix-runtime
ports:
- containerPort: 8080
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /ready
port: 8080
scheme: HTTP
initialDelaySeconds: 5
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /models
name: model-hostpath
dnsPolicy: ClusterFirst
initContainers:
- command:
- aibrix_download
- --model-uri
- tos://aibrix-artifact-testing/models/deepseek-llm-7b-chat/
- --local-dir
- /models/
env:
- name: DOWNLOADER_MODEL_NAME
value: deepseek-llm-7b-chat
- name: DOWNLOADER_NUM_THREADS
value: "16"
- name: DOWNLOADER_ALLOW_FILE_SUFFIX
value: json, safetensors, bin
- name: TOS_ACCESS_KEY
valueFrom:
secretKeyRef:
key: TOS_ACCESS_KEY
name: tos-credential
- name: TOS_SECRET_KEY
valueFrom:
secretKeyRef:
key: TOS_SECRET_KEY
name: tos-credential
- name: TOS_ENDPOINT
value: tos-cn-beijing.ivolces.com
- name: TOS_REGION
value: cn-beijing
image: aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/runtime:v0.1.1
imagePullPolicy: IfNotPresent
name: init-model
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /models
name: model-hostpath
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
volumes:
- hostPath:
path: /root/models
type: DirectoryOrCreate
name: model-hostpath
- emptyDir:
medium: Memory
sizeLimit: 4Gi
name: dshm
status:
availableReplicas: 7
conditions:
- lastTransitionTime: "2025-01-22T22:23:26Z"
lastUpdateTime: "2025-01-22T22:32:56Z"
message: ReplicaSet "aibrix-model-deepseek-llm-7b-chat-5c54b7dd47" has successfully
progressed.
reason: NewReplicaSetAvailable
status: "True"
type: Progressing
- lastTransitionTime: "2025-01-23T20:23:21Z"
lastUpdateTime: "2025-01-23T20:23:21Z"
message: Deployment has minimum availability.
reason: MinimumReplicasAvailable
status: "True"
type: Available
observedGeneration: 13
readyReplicas: 7
replicas: 7
updatedReplicas: 7
gpu cache usage in one of pods
gpu cache usage collected by aibrix-controller-manager
You can see there are 7 pods running even when target metric is 0.
Steps to Reproduce
No response
Expected behavior
Expected behavior: pods should scale down to minReplica.
Environment
aibrix version: v0.2.0-rc.1 Kubernetes deepseek-llm-7b-chat
Please describe podautoscaler next time, we know more details like the condition and events and more controller-manager-logs.
kubectl describe podautoscaler llama2-70b-pa
Does this issue still exist? Seems this is a blocker issue?
@gangmuk I remember you did some follow up autoscaling testing. is this issue still not resolved? We supposed to make it work at least around March. I notice this issue was cut in late Jan. Can you confirm it?
@Jeffwan It was resolved. I remember I was using the wrong unit. (0.0-1.0 v.s. 0-100)