aibrix
aibrix copied to clipboard
The pod metrics may be not up-to-date
Currently, aibrix syncs the pod metrics in one single goroutine: https://github.com/vllm-project/aibrix/blob/2f5dd942980d1809e3c4569fec196f4ab492daf5/pkg/cache/cache_init.go#L173-L182
In each loop, it will fetch the metrics pod by pod.
Assumed that we have 100 pods, each pod requires 1ms to fetch. It will take 100ms to finish one loop, even if the default refresh interval is 50ms. Note that some pods may be slow or even cause the fetch to time out.
Yea, this is a trade-off between accuracy and performance.
- If one goroutine is created per pod to refresh metrics, it will put pressure on gateway and may spike up load on all pods simultaneously.
- Second point is with respect to refresh interval, 50ms is pretty aggressive interval to refresh metrics but we identified that some usecases even that was not enough, one example is to track active running requests per pod.
Overall design philosophy is that only use pod metrics for use cases which can be tolerant to delays or even unavailability of metrics for brief period.
One sweet spot can be to have configurable fixed number of go-routines to fetch pod metrics. Please suggest other alternatives.
cc @zhangjyr @nwangfw
If one goroutine is created per pod to refresh metrics, it will put pressure on gateway and may spike up load on all pods simultaneously.
IMHO, we can add a jitter for the first fetch, and as different pods have different latencies, we won't fetch all pods simultaneously later. Assuming we have 200 pods and fetch each pod per 50ms, there will be 4k rps, which is acceptable.
I will try to come up with a good solution to this problem in the next few days. 😄
/assign