aibrix KPA does not scale down after scaling up -> took too long

🐛 Describe the bug

KPA never scales down after scaling up. Scaling up works but scaling down never happens even when there is 0 load, basically gpu_cache_usage_perc is 0.

Expected behavior: it should scale down to minReplica.

Currentn behavior: it does not scale down.

KPA podautoscaler

apiVersion: autoscaling.aibrix.ai/v1alpha1
kind: PodAutoscaler
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"autoscaling.aibrix.ai/v1alpha1","kind":"PodAutoscaler","metadata":{"annotations":{},"labels":{"app.kubernetes.io/managed-by":"kustomize","app.kubernetes.io/name":"aibrix"},"name":"podautoscaler-aibrix-model-deepseek-llm-7b-chat-kpa","namespace":"default"},"spec":{"maxReplicas":10,"metricsSources":[{"metricSourceType":"pod","path":"metrics","port":"8000","protocolType":"http","targetMetric":"gpu_cache_usage_perc","targetValue":"0.5"}],"minReplicas":1,"scaleTargetRef":{"apiVersion":"apps/v1","kind":"Deployment","name":"aibrix-model-deepseek-llm-7b-chat"},"scalingStrategy":"KPA"}}
  creationTimestamp: "2025-01-23T06:25:26Z"
  generation: 3
  labels:
    app.kubernetes.io/managed-by: kustomize
    app.kubernetes.io/name: aibrix
  name: podautoscaler-aibrix-model-deepseek-llm-7b-chat-kpa
  namespace: default
  resourceVersion: "102456604"
  uid: dc0647f3-c2af-4352-9db2-a6cbc11d3680
spec:
  maxReplicas: 10
  metricsSources:
  - metricSourceType: pod
    path: metrics
    port: "8000"
    protocolType: http
    targetMetric: gpu_cache_usage_perc
    targetValue: "0.5"
  minReplicas: 1
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: aibrix-model-deepseek-llm-7b-chat
  scalingStrategy: KPA
status:
  conditions:
  - lastTransitionTime: "2025-01-23T06:25:26Z"
    message: the KPA controller was able to get the target's current scale
    reason: SucceededGetScale
    status: "True"
    type: AbleToScale

deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "4"
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"apps/v1","kind":"Deployment","metadata":{"annotations":{},"labels":{"model.aibrix.ai/name":"deepseek-llm-7b-chat","model.aibrix.ai/port":"8000"},"name":"aibrix-model-deepseek-llm-7b-chat","namespace":"default"},"spec":{"replicas":1,"selector":{"matchLabels":{"model.aibrix.ai/name":"deepseek-llm-7b-chat"}},"strategy":{"type":"Recreate"},"template":{"metadata":{"annotations":{"prometheus.io/path":"/metrics","prometheus.io/port":"8000","prometheus.io/scrape":"true"},"labels":{"model.aibrix.ai/name":"deepseek-llm-7b-chat"}},"spec":{"affinity":{"nodeAffinity":{"requiredDuringSchedulingIgnoredDuringExecution":{"nodeSelectorTerms":[{"matchExpressions":[{"key":"machine.cluster.vke.volcengine.com/gpu-name","operator":"In","values":["Tesla-V100"]}]}]}}},"containers":[{"command":["python3","-m","vllm.entrypoints.openai.api_server","--host","0.0.0.0","--port","8000","--model","/models/deepseek-llm-7b-chat","--served-model-name","deepseek-llm-7b-chat","--trust-remote-code","--api-key","xxxx","--dtype","half"],"image":"aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/vllm-openai:v0.6.2-distributed","imagePullPolicy":"IfNotPresent","livenessProbe":{"failureThreshold":3,"httpGet":{"path":"/health","port":8000,"scheme":"HTTP"},"initialDelaySeconds":90,"periodSeconds":5,"successThreshold":1,"timeoutSeconds":1},"name":"vllm-openai","ports":[{"containerPort":8000,"protocol":"TCP"}],"readinessProbe":{"failureThreshold":3,"httpGet":{"path":"/health","port":8000,"scheme":"HTTP"},"initialDelaySeconds":90,"periodSeconds":5,"successThreshold":1,"timeoutSeconds":1},"resources":{"limits":{"nvidia.com/gpu":"1"},"requests":{"nvidia.com/gpu":"1"}},"volumeMounts":[{"mountPath":"/models","name":"model-hostpath"},{"mountPath":"/dev/shm","name":"dshm"}]},{"command":["aibrix_runtime","--port","8080"],"env":[{"name":"INFERENCE_ENGINE","value":"vllm"},{"name":"INFERENCE_ENGINE_ENDPOINT","value":"http://localhost:8000"}],"image":"aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/runtime:v0.2.0-rc.1","livenessProbe":{"httpGet":{"path":"/healthz","port":8080},"initialDelaySeconds":3,"periodSeconds":2},"name":"aibrix-runtime","ports":[{"containerPort":8080,"protocol":"TCP"}],"readinessProbe":{"httpGet":{"path":"/ready","port":8080},"initialDelaySeconds":5,"periodSeconds":10},"volumeMounts":[{"mountPath":"/models","name":"model-hostpath"}]}],"initContainers":[{"command":["aibrix_download","--model-uri","tos://aibrix-artifact-testing/models/deepseek-llm-7b-chat/","--local-dir","/models/"],"env":[{"name":"DOWNLOADER_MODEL_NAME","value":"deepseek-llm-7b-chat"},{"name":"DOWNLOADER_NUM_THREADS","value":"16"},{"name":"DOWNLOADER_ALLOW_FILE_SUFFIX","value":"json, safetensors, bin"},{"name":"TOS_ACCESS_KEY","valueFrom":{"secretKeyRef":{"key":"TOS_ACCESS_KEY","name":"tos-credential"}}},{"name":"TOS_SECRET_KEY","valueFrom":{"secretKeyRef":{"key":"TOS_SECRET_KEY","name":"tos-credential"}}},{"name":"TOS_ENDPOINT","value":"tos-cn-beijing.ivolces.com"},{"name":"TOS_REGION","value":"cn-beijing"}],"image":"aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/runtime:v0.1.1","name":"init-model","volumeMounts":[{"mountPath":"/models","name":"model-hostpath"}]}],"volumes":[{"hostPath":{"path":"/root/models","type":"DirectoryOrCreate"},"name":"model-hostpath"},{"emptyDir":{"medium":"Memory","sizeLimit":"4Gi"},"name":"dshm"}]}}}}
  creationTimestamp: "2025-01-22T22:01:21Z"
  generation: 13
  labels:
    model.aibrix.ai/name: deepseek-llm-7b-chat
    model.aibrix.ai/port: "8000"
  name: aibrix-model-deepseek-llm-7b-chat
  namespace: default
  resourceVersion: "102924827"
  uid: 83176d98-18d8-492e-a726-cd8ddc75bac3
spec:
  progressDeadlineSeconds: 600
  **replicas: 7** # this part
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      model.aibrix.ai/name: deepseek-llm-7b-chat
  strategy:
    type: Recreate
  template:
    metadata:
      annotations:
        prometheus.io/path: /metrics
        prometheus.io/port: "8000"
        prometheus.io/scrape: "true"
      creationTimestamp: null
      labels:
        model.aibrix.ai/name: deepseek-llm-7b-chat
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: machine.cluster.vke.volcengine.com/gpu-name
                operator: In
                values:
                - Tesla-V100
      containers:
      - command:
        - python3
        - -m
        - vllm.entrypoints.openai.api_server
        - --host
        - 0.0.0.0
        - --port
        - "8000"
        - --model
        - /models/deepseek-llm-7b-chat
        - --served-model-name
        - deepseek-llm-7b-chat
        - --trust-remote-code
        - --api-key
        - xxxx
        - --dtype
        - half
        image: aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/vllm-openai:v0.6.2-distributed
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /health
            port: 8000
            scheme: HTTP
          initialDelaySeconds: 90
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 1
        name: vllm-openai
        ports:
        - containerPort: 8000
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /health
            port: 8000
            scheme: HTTP
          initialDelaySeconds: 90
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          limits:
            nvidia.com/gpu: "1"
          requests:
            nvidia.com/gpu: "1"
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /models
          name: model-hostpath
        - mountPath: /dev/shm
          name: dshm
      - command:
        - aibrix_runtime
        - --port
        - "8080"
        env:
        - name: INFERENCE_ENGINE
          value: vllm
        - name: INFERENCE_ENGINE_ENDPOINT
          value: http://localhost:8000
        image: aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/runtime:v0.2.0-rc.1
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /healthz
            port: 8080
            scheme: HTTP
          initialDelaySeconds: 3
          periodSeconds: 2
          successThreshold: 1
          timeoutSeconds: 1
        name: aibrix-runtime
        ports:
        - containerPort: 8080
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /ready
            port: 8080
            scheme: HTTP
          initialDelaySeconds: 5
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /models
          name: model-hostpath
      dnsPolicy: ClusterFirst
      initContainers:
      - command:
        - aibrix_download
        - --model-uri
        - tos://aibrix-artifact-testing/models/deepseek-llm-7b-chat/
        - --local-dir
        - /models/
        env:
        - name: DOWNLOADER_MODEL_NAME
          value: deepseek-llm-7b-chat
        - name: DOWNLOADER_NUM_THREADS
          value: "16"
        - name: DOWNLOADER_ALLOW_FILE_SUFFIX
          value: json, safetensors, bin
        - name: TOS_ACCESS_KEY
          valueFrom:
            secretKeyRef:
              key: TOS_ACCESS_KEY
              name: tos-credential
        - name: TOS_SECRET_KEY
          valueFrom:
            secretKeyRef:
              key: TOS_SECRET_KEY
              name: tos-credential
        - name: TOS_ENDPOINT
          value: tos-cn-beijing.ivolces.com
        - name: TOS_REGION
          value: cn-beijing
        image: aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/runtime:v0.1.1
        imagePullPolicy: IfNotPresent
        name: init-model
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /models
          name: model-hostpath
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      volumes:
      - hostPath:
          path: /root/models
          type: DirectoryOrCreate
        name: model-hostpath
      - emptyDir:
          medium: Memory
          sizeLimit: 4Gi
        name: dshm
status:
  availableReplicas: 7
  conditions:
  - lastTransitionTime: "2025-01-22T22:23:26Z"
    lastUpdateTime: "2025-01-22T22:32:56Z"
    message: ReplicaSet "aibrix-model-deepseek-llm-7b-chat-5c54b7dd47" has successfully
      progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  - lastTransitionTime: "2025-01-23T20:23:21Z"
    lastUpdateTime: "2025-01-23T20:23:21Z"
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  observedGeneration: 13
  readyReplicas: 7
  replicas: 7
  updatedReplicas: 7

gpu cache usage in one of pods

gpu cache usage collected by aibrix-controller-manager

You can see there are 7 pods running even when target metric is 0.

Steps to Reproduce

No response

Expected behavior

Expected behavior: pods should scale down to minReplica.

Environment

aibrix version: v0.2.0-rc.1 Kubernetes deepseek-llm-7b-chat

Jan 23 '25 20:01 gangmuk

Please describe podautoscaler next time, we know more details like the condition and events and more controller-manager-logs.

kubectl describe podautoscaler llama2-70b-pa

Does this issue still exist? Seems this is a blocker issue?

Jan 25 '25 07:01 Jeffwan

@gangmuk I remember you did some follow up autoscaling testing. is this issue still not resolved? We supposed to make it work at least around March. I notice this issue was cut in late Jan. Can you confirm it?

Apr 30 '25 19:04 Jeffwan

@Jeffwan It was resolved. I remember I was using the wrong unit. (0.0-1.0 v.s. 0-100)

Apr 30 '25 19:04 gangmuk