gRPC Connection Reset( Re-opened)
Problem Description Experiencing a common issue with long-lived gRPC connections being reset after periods of inactivity. The example application initializes gRPC clients during startup and stores them in a configuration map.
After approximately 38 minutes of inactivity, subsequent calls fail with: UNAVAILABLE: io exception Caused by: io.grpc.netty.shaded.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
16:26:53 - Successful gRPC call made 17:04:26 - Call fails with "Connection reset by peer" (after ~38 minutes of inactivity)
Client gRPC app and the Server gRPC are deployed on kubernetes and are on separate namespace.
Example code:
@Configuration
public class ServiceConfig {
private Map<String, ServiceGrpcClient> clientMap = new HashMap<>();
@PostConstruct
public void init() {
// Initialize clients at startup
hostMap.forEach((key, host) -> {
ServiceGrpcClient client = new ServiceGrpcClient(host, portMap.get(key));
clientMap.put(key, client);
});
}
public ServiceGrpcClient getClient(String key) {
return clientMap.get(key);
}
}
@Slf4j
@Component
public class GenericServiceClient {
@Value("${grpc.service.host}")
private String host;
@Value("${grpc.service.port}")
private Integer port;
private ManagedChannel channel;
private ServiceStub blockingStub;
public GenericServiceClient(String serverHost, int target) {
this.targetPort = target;
this.serverHost = serverHost;
this.init();
}
public void init() {
log.debug("Connecting to Service: {}, port:{}", this.serverHost, this.targetPort);
this.blockingStub = ServiceStub.newBlockingStub(this.getManagedChannel());
}
private ManagedChannel getManagedChannel() {
ManagedChannel managedChannel = ManagedChannelBuilder.forAddress(this.serverHost, this.targetPort)
.usePlaintext().keepAliveTime(120, TimeUnit.SECONDS)
.keepAliveTimeout(60, TimeUnit.SECONDS).build();
return managedChannel;
}
public ServiceStub getBlockingStub() {
return blockingStub;
}
//some other methods to interact with the service can be added here
}
Is it a best practice to create gRPC clients at the startup( eager loading) vs lazy ? Are there best practices for handling idle connection timeouts in gRPC where the clients are created at the startup? What are the recommended keepalive settings to prevent this issue? Is there a way to detect stale connections before attempting to use them? How do others handle this in production environments with firewalls and load balancers?
Creating connections at startup vs lazy depends on the use case, obviously creating them at startup incurs some network and CPU cost if there are no RPCs.
To keep connections alive when there are no RPCs without being closed by firewalls and loadbalancers, you should set keepAliveWithoutCalls(true) on the ManagedChannel. This will cause pings to be sent even when there are no RPCs and if the connection has been closed, then the channel will come to know of that before making the RPC.
Refer also this.
Thanks for the link. According to the link
We did have keepAliveWithoutCalls(true) earlier , and still encountered this issue .
Just curious to understand, what happens to the Managed Channel when the "UNAVAILABLE: io exception Caused by: io.grpc.netty.shaded.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer" is thrown.
Before the next call, does it reinstate the broken connection ?
Just so that if we can have retry block !
Yes, if the gRPC client detects the connection to be broken via the absence of responses to the ping frames, it will close the connection and it reconnect if a new RPC call is made.
With keepAliveWithoutCalls(true) did you observe the same behavior of connection reset after 38 mins?
Can you also check if you are limiting connection age/idle time via the maxConnectionAge or the maxConnectionIdle setting on the server builder?
The same issue seen with keepAliveWithoutCalls(true)
No other configurations done like - connection age/idle time via the maxConnectionAge or the maxConnectionIdle.
Its a simple managed channel all loaded with default settings with keepAliveTime(120s) and keepAliveTimeOut(60s).
Jus to reclarify the point with our current settings ( default provided by gRPC) and only keepAliveTime(120s) and keepAliveTimeOut(60s) the below is possible ?
Step1: gRPC encounters I/o exception with connection reset Step2: gPRC now re-instates a broken connection automatically step3: if a retry mechanism is in place, the call must go well without any issues?
Regards, Hari
Yes, that is correct.
Seems like this is resolved. If not, comment, and it can be reopened.
gRPC KeepAlive Settings Issue
@ejona86 Sorry, we are still struggling with this issue and would appreciate some guidance from the community. Could you provide general guidelines or best practices for configuring gRPC keepAlive settings?
Current setup
- Client: Spring Boot Starter 5.0.0 (gRPC 1.51.0)
- Server: Spring Boot Starter 3.5.3 gRPC 1.28.0)
Server side
- No settings for
PERMIT_KEEPALIVE_WITHOUT_CALLS
Client side (initial config)
managedChannel = ManagedChannelBuilder
.forAddress(serverHost, targetPort)
.usePlaintext()
.keepAliveTime(10, TimeUnit.SECONDS)
.keepAliveTimeout(5, TimeUnit.SECONDS)
.keepAliveWithoutCalls(true)
.build();
- We understand this can make the client a potential DDoS risk. We did see errors like
ENHANCE_CALM.
Client side (new config)
managedChannel = ManagedChannelBuilder
.forAddress(serverHost, targetPort)
.usePlaintext()
.keepAliveTime(120, TimeUnit.SECONDS)
.keepAliveTimeout(60, TimeUnit.SECONDS)
.build();
- Note: This is without
keepAliveWithoutCalls.
Issue
We are still seeing the same error:
io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
Caused by: java.io.IOException: Connection reset by peer
Next steps we want to try
Server side
- Enable
PERMIT_KEEPALIVE_WITHOUT_CALLS = true
Client side
managedChannel = ManagedChannelBuilder
.forAddress(serverHost, targetPort)
.usePlaintext()
.keepAliveTime(45, TimeUnit.SECONDS)
.keepAliveTimeout(25, TimeUnit.SECONDS)
.keepAliveWithoutCalls(true)
.build();
- If we encounter errors like
Connection resetorStatusRuntimeException, we plan to retry the RPC call.
Keepalive Timeout and Connection Handling Scenario ( for 120/60 case,keepAliveWithoutCalls(false)
T+180s:
- ✅ Keepalive timeout expires
- ✅ gRPC marks connection as TRANSIENT_FAILURE
- ❌ NO MORE PINGS - keepalive stops completely
T+180s+ (next RPC call):
- Client calls
method1RPC() - gRPC sees connection is dead
- Attempts to reconnect (new TCP connection)
- If reconnect fails → "Connection reset by peer"
Question: Will gRPC handle this automatically or is it the application's responsibility to handle this error? - If reconnect succeeds → Happy path: RPC works, new keepalive cycle starts
- If reconnect fails → "Connection reset by peer"
I'm really curious—what happens on the gRPC side if it sees a dead or stale connection?
Questions
- Are there recommended values or patterns for keepAlive settings to avoid these issues?
- Is enabling
PERMIT_KEEPALIVE_WITHOUT_CALLSon the server side advisable in this scenario? - Any other suggestions to prevent
Connection reset by peererrors?
Thank you for your help!
Connection reset will cause a TRANSIENT_ERROR for the channel and it will be retried with exponential backoff. The retry policy, including backoff parameters (initial backoff, max backoff, backoff multiplier), retryable status codes, and maximum attempts, can be configured in the gRPC service configuration. These settings exist so that the application code doesn't have to do the retrying by itself for retriable errors.
With 120s keepalive time and without keep alive without calls, the connection reset you are getting may be due to intermediary firewalls, proxies or NAT devices that reset the connection based on the idle setting they have. It looks like you should still use keep alive without calls and with a frequency for the pings set in such a way not to trigger DdOS.
This is useless:
.keepAliveTime(10, TimeUnit.SECONDS)We understand this can make the client a potential DDoS risk. We did see errors like ENHANCE_CALM.
Without the server permitting the keepalives, the grpc client will keep increasing the keepalive each ENHANCE_YOUR_CALM failure until it is greater than what the server allows (for grpc servers, defaults to 5 minutes). You aren't actually testing 10 second keepalive, because the client will slow down after the failures!
It sounds like you need the client keepalive time to be less than 5 minutes. The only way to make that work is for the server to permit such keepalive times. If using grpc-java for the server (without any L7 proxy), that'd be something like serverBuilder.permitKeepAliveTime(45, SECONDS).
Generally I'd suggest conservative (infrequent) keepalive on the server, just for garbage collection. If you're using a L4 load balancer/proxy that requires a certain activity frequency, then it does make sense to handle that at the server as well.
Often the most aggressive keepalive is needed because of the specific networks between the client and the server, such that only certain clients need more aggressive values. Also, it is generally the clients that are harmed most by any increase in latency; detecting a connection breakage on the server-side doesn't help the clients. But when you configure keepalive on the client, you still need the server to allow those more aggressive keepalives. Otherwise it serves no purpose.