grpc-java
grpc-java copied to clipboard
Avoid unnecessary flushes for unary responses
We currently optimize flushes for unary requests on client-side, by delaying flushing until the end of the RPC. When looking at the code, I realized it doesn't appear we're doing that for server-side.
Using wireshare with the interop client/server with empty_unary, we can see a single packet for the request but three packets for the response:

We should optimize the response flow so that all three frames are sent with a single flush.
@ejona86 following up on this.
We are working on migrating our homegrown netty based http 1.1 RPC framework to grpc-java and we noticed a significant performance regression on the server side.
In a scenario of many clients (>100) vs a single server the cpu usage normalized per request for the server is much worse for gRPC.
We noticed that indeed the number of transmitted segments normalized per request (observed from Tcp OutSegs in /proc/net/snmp) is higher for gRPC vs our current framework. We suspect this may be the primary reason for the performance regression.
Was this issue (of responding with multiple packets) discussed / mitigated in some way? Can we contribute to fixing it?
Nothing has changed here.
You can contribute, but IIRC part of the trouble here is how trailers are plumbed in the internals. We might need to detect this case directly in Netty, similar to sending headers on client-side.
@ejona86 #9273 avoids flushing after writing the message, but flushing after writing the headers is still performed. This is a bit trickier to handle so I'm leaving it to a follow up issue & PR.
This is improved with #9273, but response headers are still a separate packet, so reopening.
@ejona86 do you have any explanation why this change doesn't reduce the number of packets sent? Are there additional packets being sent as part of the HTTP/2 protocol beyond the headers + body + trailers? For example I noticed this flush which doesn't seem related to writing out data, but to flow control management. Is there really an HTTP/2 (or grpc specific) overhead of ~2 packets for each response?
Are there additional packets being sent as part of the HTTP/2 protocol beyond the headers + body + trailers?
There's WINDOW_UPDATE and potentially PING.
That flush is because returnProcessedBytes() may have written a WINDOW_UPDATE.
@ejona86, @ohadgur and me did some research and apparently the extra packets are PINGs. The reason is that FlowControlPinger by default has autoTuneFlowControlOn set to true and the ping limiter does not limit pings.
So the server sends a ping for every request (as long as there is no inflight ping) and also returns an ack for client pings - 2 extra packets for each request / response cycle.
Do you consider this default setup for flow control to be reasonable? Or maybe the number of pings should be throttled?
We changed the setup to have an explicit flowControlWindow to avoid the extra pings.
I'd be happy to have some approach to limit the number of PINGs. I'm not entirely wild about the approaches I've heard C core and .Net do. I think I'd go with the approach of "delay the ping by the BDP bytes for every monitor PING that hasn't increased the window." Up to some limit (10?). Assuming the connection is saturated, initially we'd do a PING every RTT, and then we'd do it approximately 2RTT, 3RTT, until 11*RTT. (The +1 RTT here is because I am describing the time between PINGs and that includes the time it takes for the PING ack to be received.) We can also limit the minimum BDP used for the delay to 64 KB to reduce unnecessary pings on most unsaturated links.
Do you consider this default setup for flow control to be reasonable
@amirhadadi from lengthy exchange at https://github.com/grpc/grpc-java/issues/8260 this seems to be a "feature" and appears working as intended. Also It looks like mentioned issue was renamed by maintainers to something obscurely vague, but in short after https://github.com/grpc/grpc-java/pull/7446 was introduced having up to 1000 PINGS/second /w autoflowControl is a new norm.
It is definitely a feature. Without it, in some not-that-rare of scenarios, you could be getting < 2% of the available throughput from a connection. And I would actually agree it is too aggressive. My pushing back on #8260 was much more to determine whether it was "someone stumbled across it and thought to themselves, that seems ineffecient" which isn't all interesting or whether it was "something was going wrong such that the overly aggressive behavior was noticed."
This issue is a case where "something was going wrong." Normally we try to delay flushes and the like, but the timing is actually sensitive here. (Originally the design had chosen to send the PINGs along with connection WINDOW_UPDATES, which would have avoided the flush. But that fails to notice any bytes in some semi-trivial scenarios.)