aibrix icon indicating copy to clipboard operation
aibrix copied to clipboard

Piggybacking more information in response header

Open gangmuk opened this issue 9 months ago • 7 comments

🚀 Feature Description and Motivation

Suggesting piggybacking more information in the header on the response. For example, currently gateway is returning the target-pod-ip on the response header. I suggest including more information in this manner on the response. It would be useful since we can get snapshot information at the per request level when the request is scheduled. Request granularity information will be very useful for post-analysis and more.

Candidate state information would be queue size, GPU memory utilization, KV cache hit ratio, RPS for each GPU, TPS for each GPU, etc. The exact list should be discussed. The requirement is that none of them shouldn't introduce overhead on request critical path.

Downside/overhead of including more information on the response header would be overhead in the gateway and the request size gets bigger. Neither is significant.

Use Case

post-analysis

Proposed Solution

piggybacking more information on the response header

gangmuk avatar Mar 05 '25 06:03 gangmuk

@Jeffwan @varungup90 WDYH?

gangmuk avatar Mar 05 '25 06:03 gangmuk

Only per request level information should be returned in response headers. The information listed in the issue is captured in the metrics which is reflected in dashboard and is queryable as well by client.

varungup90 avatar Mar 05 '25 06:03 varungup90

Yeah that's true. but if we want to map the state to a particular request that it was scheduled, snapshot is needed. I wonder what would be downside of it. any thoughts?

gangmuk avatar Mar 05 '25 07:03 gangmuk

Request and response headers must be light weight. You can dump the state in logs for per request basis.

varungup90 avatar Mar 05 '25 18:03 varungup90

@varungup90 @Jeffwan Not sure you've heard of it. but in envoy, there was similar discussion in the past. They proposed ORCA. It is a proposal for an open standard for request cost aggregation. I think it was integrated into envoy officially. We don't need to follow the exact format but we can maybe think about so-called AIBrix ORCA things for AI specific metrics and support them in AIBrix.

orca issue in envoy orca design doc

gangmuk avatar Mar 06 '25 01:03 gangmuk

An unrelated question: After appending the target-pod-IP to the request header, what is the subsequent process for routing to the specified pod based on the target-pod-IP within the request header?

yexu-code avatar Mar 18 '25 07:03 yexu-code

@yexu-code I don't think we are currently supporting specifying the target-pod-ip in the request header to route a request to the specified pod. Is that what you asked? We can add this feature. I found this could be useful for testing purpose as well.

gangmuk avatar Mar 19 '25 07:03 gangmuk