Gpu optimizer write deployment replica suggestion and autoscaler go through the calculation again
🚀 Feature Description and Motivation
metricsSources:
- endpoint: gpu-optimizer.aibrix-system.svc.cluster.local:8080
path: /metrics/aibrix-system/simulator-llama2-7b-a100
metric: "vllm:deployment_replicas"
targetValue: "1"
In heterogeneous story, gpu_optimizer expose an endpoint /metrics/${namespace}/${scale_target_name}. Seem here're some issues, we used to fetch metrics from pods and run through some calculation to get a desired replica. Here, due to the current component design, gpu_optimizer returns a value that autoscaler should adopt. but this would be a different workflow comparing to traditional metrics. The autoscaler will calculate the values and compare with targetValue to come up a new value.
Let's double check the logics here.
Use Case
No response
Proposed Solution
No response
In podautoscaler settings. The targetValue is set to "1". So KPA will scale in times of integer the gpu_optimzer suggests. However, we currently depend on the KPA algorithm to stabilize the changes. I think scaling out is ok. Scaling in, however, sometimes too reservative. We can create a new policy to customize the behavior.
@Jeffwan How to understand this would be a different workflow comparing to traditional metrics.?
Does gpu_optimizer have the same input and output as other PA? i.e, does it pass metrics and return a scalar replica?
@kr11 yes, the gpu_optimizer exposes an endpoint and the returned metrics is like vllm:deployment_replicas =1. My point was we used to transform a metric to deployment replicas. but now, it accepts the replica directly, then targetValue would be always 1. this could be a design issue on the external metric source. If we do not scrape metrics from pods but from external service, the aggregation logic would be changed. that's the "different workflow"
@zhangjyr @nwangfw what's the status of this issue?