aibrix icon indicating copy to clipboard operation
aibrix copied to clipboard

engine prefix should be provided by user and we should not amend `vllm:`

Open Jeffwan opened this issue 2 months ago • 2 comments

🐛 Describe the bug

W0930 18:13:41.031286       1 fetcher.go:99] Failed to fetch metric vllm:gpu_cache_usage_perc from pod default/mock-llama2-7b-7cc98b7f5f-764t4: metric vllm:gpu_cache_usage_perc not found in central registry. Returning zero value.

Steps to Reproduce

  metricsSources:
    - metricSourceType: pod
      protocolType: http
      port: "8000"
      path: metrics
      targetMetric: "avg_prompt_throughput_toks_per_s" # change it to `vllm:avg_prompt_throughput_toks_per_s`
      targetValue: "60"
  scalingStrategy: "KPA"

Expected behavior

it should work

Environment

nightly

Jeffwan avatar Sep 30 '25 18:09 Jeffwan

We should use implicitly append vllm:, this is really confusing, we should ask user to type the full name to avoid the conversion

Jeffwan avatar Sep 30 '25 21:09 Jeffwan

I use YAML as below to be working. I'm using the mock vllm app, so I don't need to set the "vllm:" prefix, and it works as expected. Could you please explain more detail? 🤔

➜  ~ curl http://127.0.0.1:8000/metrics | grep gpu
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 24185  100 24185    0     0  1256k      0 --:--:-- --:--:-- --:--:-- 1312k
# HELP vllm:gpu_cache_usage_perc GPU KV-cache usage. 1 means 100 percent usage.
# TYPE vllm:gpu_cache_usage_perc gauge
vllm:gpu_cache_usage_perc{model_name="llama2-7b"} 0.0
vllm:gpu_cache_usage_perc{model_name="text2sql-lora-2"} 0.0

apiVersion: autoscaling.aibrix.ai/v1alpha1
kind: PodAutoscaler
metadata:
  name: mock-llama2-7b-hpa
  namespace: default
  labels:
    app.kubernetes.io/name: aibrix
    app.kubernetes.io/managed-by: kustomize
spec:
  scalingStrategy: APA
  minReplicas: 1
  maxReplicas: 2
  metricsSources:
    - metricSourceType: pod
      protocolType: http
      port: '8000'
      path: /metrics
      targetMetric: gpu_cache_usage_perc
      targetValue: '50'
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: mock-llama2-7b

googs1025 avatar Oct 10 '25 01:10 googs1025