Piggybacking more information in response header
🚀 Feature Description and Motivation
Suggesting piggybacking more information in the header on the response. For example, currently gateway is returning the target-pod-ip on the response header. I suggest including more information in this manner on the response. It would be useful since we can get snapshot information at the per request level when the request is scheduled. Request granularity information will be very useful for post-analysis and more.
Candidate state information would be queue size, GPU memory utilization, KV cache hit ratio, RPS for each GPU, TPS for each GPU, etc. The exact list should be discussed. The requirement is that none of them shouldn't introduce overhead on request critical path.
Downside/overhead of including more information on the response header would be overhead in the gateway and the request size gets bigger. Neither is significant.
Use Case
post-analysis
Proposed Solution
piggybacking more information on the response header
@Jeffwan @varungup90 WDYH?
Only per request level information should be returned in response headers. The information listed in the issue is captured in the metrics which is reflected in dashboard and is queryable as well by client.
Yeah that's true. but if we want to map the state to a particular request that it was scheduled, snapshot is needed. I wonder what would be downside of it. any thoughts?
Request and response headers must be light weight. You can dump the state in logs for per request basis.
@varungup90 @Jeffwan Not sure you've heard of it. but in envoy, there was similar discussion in the past. They proposed ORCA. It is a proposal for an open standard for request cost aggregation. I think it was integrated into envoy officially. We don't need to follow the exact format but we can maybe think about so-called AIBrix ORCA things for AI specific metrics and support them in AIBrix.
An unrelated question: After appending the target-pod-IP to the request header, what is the subsequent process for routing to the specified pod based on the target-pod-IP within the request header?
@yexu-code I don't think we are currently supporting specifying the target-pod-ip in the request header to route a request to the specified pod. Is that what you asked? We can add this feature. I found this could be useful for testing purpose as well.