grpc-java
grpc-java copied to clipboard
Performance issues caused by abnormal HTTP/2 SETTINGS frame exchange
Title: Performance issues caused by abnormal HTTP/2 SETTINGS frame exchange
We experienced two problems in gRPC communication between our Java client and Python server:
- Java client version: 1.51.0
- Python server version: 1.70
1. HTTP/2 SETTINGS frame exchange issue
When creating the channel in the Java gRPC client, the client must wait for the server to return an HTTP/2 SETTINGS frame before updating the channel state to READY (this is specifically implemented in io.grpc.netty.NettyClientHandler.FrameListener#onSettingsRead). However, due to a version compatibility problem in our Python gRPC server, the server fails to correctly return the SETTINGS frame, causing the client's channel to remain in the CONNECTING state for a long time.
2. Performance bottleneck analysis
In this state, if we send a large number of RPC requests without deadlines, these requests are accumulated as pending streams and stored in the pendingStreams property of DelayedClientTransport. In our environment, about one million such requests were queued.
When the TCP connection fails to establish, io.grpc.internal.DelayedClientTransport#reprocess is triggered. Inside this method, the system calls pendingStreams.removeAll(toRemove). Since toRemove is implemented as an ArrayList rather than a Set, it results in an O(n²) complexity. When n is very large (e.g., around one million), this blocks the IO thread for around 30 minutes, causing severe IO thread stalls.
Suggestions for optimization
Based on these observations, we propose a couple of potential improvements for grpc-java:
- Introduce a maximum waiting time for the HTTP/2 SETTINGS frame exchange to avoid waiting indefinitely for incompatible server responses.
- Change the
toRemovecollection inDelayedClientTransport#reprocessto a Set implementation, reducing the complexity from O(n²) to O(n).