aibrix The pod metrics may be not up-to-date

The pod metrics may be not up-to-date

Open spacewander opened this issue 7 months ago • 3 comments

Currently, aibrix syncs the pod metrics in one single goroutine: https://github.com/vllm-project/aibrix/blob/2f5dd942980d1809e3c4569fec196f4ab492daf5/pkg/cache/cache_init.go#L173-L182

In each loop, it will fetch the metrics pod by pod.

Assumed that we have 100 pods, each pod requires 1ms to fetch. It will take 100ms to finish one loop, even if the default refresh interval is 50ms. Note that some pods may be slow or even cause the fetch to time out.

Apr 11 '25 10:04 spacewander

Yea, this is a trade-off between accuracy and performance.

If one goroutine is created per pod to refresh metrics, it will put pressure on gateway and may spike up load on all pods simultaneously.
Second point is with respect to refresh interval, 50ms is pretty aggressive interval to refresh metrics but we identified that some usecases even that was not enough, one example is to track active running requests per pod.

Overall design philosophy is that only use pod metrics for use cases which can be tolerant to delays or even unavailability of metrics for brief period.

One sweet spot can be to have configurable fixed number of go-routines to fetch pod metrics. Please suggest other alternatives.

cc @zhangjyr @nwangfw

Apr 11 '25 21:04 varungup90

If one goroutine is created per pod to refresh metrics, it will put pressure on gateway and may spike up load on all pods simultaneously.

IMHO, we can add a jitter for the first fetch, and as different pods have different latencies, we won't fetch all pods simultaneously later. Assuming we have 200 pods and fetch each pod per 50ms, there will be 4k rps, which is acceptable.

Apr 12 '25 11:04 spacewander

I will try to come up with a good solution to this problem in the next few days. 😄

/assign

May 15 '25 13:05 googs1025

aibrix aibrix copied to clipboard

The pod metrics may be not up-to-date

aibrix
aibrix copied to clipboard