aibrix Gpu optimizer write deployment replica suggestion and autoscaler go through the calculation again

🚀 Feature Description and Motivation

  metricsSources: 
    - endpoint: gpu-optimizer.aibrix-system.svc.cluster.local:8080
      path: /metrics/aibrix-system/simulator-llama2-7b-a100
      metric: "vllm:deployment_replicas"
  targetValue: "1"

In heterogeneous story, gpu_optimizer expose an endpoint /metrics/${namespace}/${scale_target_name}. Seem here're some issues, we used to fetch metrics from pods and run through some calculation to get a desired replica. Here, due to the current component design, gpu_optimizer returns a value that autoscaler should adopt. but this would be a different workflow comparing to traditional metrics. The autoscaler will calculate the values and compare with targetValue to come up a new value.

Let's double check the logics here.

Use Case

No response

Proposed Solution

No response

Dec 03 '24 00:12 Jeffwan

In podautoscaler settings. The targetValue is set to "1". So KPA will scale in times of integer the gpu_optimzer suggests. However, we currently depend on the KPA algorithm to stabilize the changes. I think scaling out is ok. Scaling in, however, sometimes too reservative. We can create a new policy to customize the behavior.

Dec 03 '24 20:12 zhangjyr

@Jeffwan How to understand this would be a different workflow comparing to traditional metrics.?

Does gpu_optimizer have the same input and output as other PA? i.e, does it pass metrics and return a scalar replica?

Dec 04 '24 03:12 kr11

@kr11 yes, the gpu_optimizer exposes an endpoint and the returned metrics is like vllm:deployment_replicas =1. My point was we used to transform a metric to deployment replicas. but now, it accepts the replica directly, then targetValue would be always 1. this could be a design issue on the external metric source. If we do not scrape metrics from pods but from external service, the aggregation logic would be changed. that's the "different workflow"

Dec 04 '24 20:12 Jeffwan

@zhangjyr @nwangfw what's the status of this issue?

Jan 24 '25 19:01 Jeffwan

@zhangjyr @nwangfw what's the status of this issue?

The current implementation works OK for our experiments. We'll optimize GPU optimizer and podautoscaler coordination after completing the gateway routing part, which will be our next big optimization effort.

Jan 25 '25 23:01 nwangfw