grpc-java icon indicating copy to clipboard operation
grpc-java copied to clipboard

gRPC Connection Reset( Re-opened)

Open atmel21 opened this issue 6 months ago • 9 comments

Problem Description Experiencing a common issue with long-lived gRPC connections being reset after periods of inactivity. The example application initializes gRPC clients during startup and stores them in a configuration map.

After approximately 38 minutes of inactivity, subsequent calls fail with: UNAVAILABLE: io exception Caused by: io.grpc.netty.shaded.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer

16:26:53 - Successful gRPC call made 17:04:26 - Call fails with "Connection reset by peer" (after ~38 minutes of inactivity)

Client gRPC app and the Server gRPC are deployed on kubernetes and are on separate namespace.

Example code:

@Configuration
public class ServiceConfig {
    private Map<String, ServiceGrpcClient> clientMap = new HashMap<>();

    @PostConstruct
    public void init() {
        // Initialize clients at startup
        hostMap.forEach((key, host) -> {
            ServiceGrpcClient client = new ServiceGrpcClient(host, portMap.get(key));
            clientMap.put(key, client);
        });
    }

    public ServiceGrpcClient getClient(String key) {
        return clientMap.get(key);
    }
}

@Slf4j
@Component
public class GenericServiceClient {

    @Value("${grpc.service.host}")
    private String host;

    @Value("${grpc.service.port}")
    private Integer port;

    private ManagedChannel channel;
    private ServiceStub blockingStub;

    public GenericServiceClient(String serverHost, int target) {
        this.targetPort = target;
        this.serverHost = serverHost;
        this.init();
    }

    public void init() {
        log.debug("Connecting to Service: {}, port:{}", this.serverHost, this.targetPort);
        this.blockingStub = ServiceStub.newBlockingStub(this.getManagedChannel());
    }

    private ManagedChannel getManagedChannel() {
        ManagedChannel managedChannel = ManagedChannelBuilder.forAddress(this.serverHost, this.targetPort)
                .usePlaintext().keepAliveTime(120, TimeUnit.SECONDS)
                .keepAliveTimeout(60, TimeUnit.SECONDS).build();
        return managedChannel;
    }

    public ServiceStub getBlockingStub() {
        return blockingStub;
    }
    
    //some other methods to interact with the service can be added here
}

Is it a best practice to create gRPC clients at the startup( eager loading) vs lazy ? Are there best practices for handling idle connection timeouts in gRPC where the clients are created at the startup? What are the recommended keepalive settings to prevent this issue? Is there a way to detect stale connections before attempting to use them? How do others handle this in production environments with firewalls and load balancers?

atmel21 avatar May 30 '25 15:05 atmel21

Creating connections at startup vs lazy depends on the use case, obviously creating them at startup incurs some network and CPU cost if there are no RPCs. To keep connections alive when there are no RPCs without being closed by firewalls and loadbalancers, you should set keepAliveWithoutCalls(true) on the ManagedChannel. This will cause pings to be sent even when there are no RPCs and if the connection has been closed, then the channel will come to know of that before making the RPC. Refer also this.

kannanjgithub avatar Jun 02 '25 08:06 kannanjgithub

Thanks for the link. According to the link

We did have keepAliveWithoutCalls(true) earlier , and still encountered this issue .

Just curious to understand, what happens to the Managed Channel when the "UNAVAILABLE: io exception Caused by: io.grpc.netty.shaded.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer" is thrown.

Before the next call, does it reinstate the broken connection ?

Just so that if we can have retry block !

atmel21 avatar Jun 02 '25 08:06 atmel21

Yes, if the gRPC client detects the connection to be broken via the absence of responses to the ping frames, it will close the connection and it reconnect if a new RPC call is made.

With keepAliveWithoutCalls(true) did you observe the same behavior of connection reset after 38 mins?

Can you also check if you are limiting connection age/idle time via the maxConnectionAge or the maxConnectionIdle setting on the server builder?

kannanjgithub avatar Jun 02 '25 14:06 kannanjgithub

The same issue seen with keepAliveWithoutCalls(true)

No other configurations done like - connection age/idle time via the maxConnectionAge or the maxConnectionIdle.

Its a simple managed channel all loaded with default settings with keepAliveTime(120s) and keepAliveTimeOut(60s).

Jus to reclarify the point with our current settings ( default provided by gRPC) and only keepAliveTime(120s) and keepAliveTimeOut(60s) the below is possible ?

Step1: gRPC encounters I/o exception with connection reset Step2: gPRC now re-instates a broken connection automatically step3: if a retry mechanism is in place, the call must go well without any issues?

Regards, Hari

atmel21 avatar Jun 03 '25 02:06 atmel21

Yes, that is correct.

kannanjgithub avatar Jun 03 '25 08:06 kannanjgithub

Seems like this is resolved. If not, comment, and it can be reopened.

ejona86 avatar Jun 26 '25 04:06 ejona86

gRPC KeepAlive Settings Issue

@ejona86 Sorry, we are still struggling with this issue and would appreciate some guidance from the community. Could you provide general guidelines or best practices for configuring gRPC keepAlive settings?

Current setup

  • Client: Spring Boot Starter 5.0.0 (gRPC 1.51.0)
  • Server: Spring Boot Starter 3.5.3 gRPC 1.28.0)

Server side

  • No settings for PERMIT_KEEPALIVE_WITHOUT_CALLS

Client side (initial config)

managedChannel = ManagedChannelBuilder
    .forAddress(serverHost, targetPort)
    .usePlaintext()
    .keepAliveTime(10, TimeUnit.SECONDS)
    .keepAliveTimeout(5, TimeUnit.SECONDS)
    .keepAliveWithoutCalls(true)
    .build();
  • We understand this can make the client a potential DDoS risk. We did see errors like ENHANCE_CALM.

Client side (new config)

managedChannel = ManagedChannelBuilder
    .forAddress(serverHost, targetPort)
    .usePlaintext()
    .keepAliveTime(120, TimeUnit.SECONDS)
    .keepAliveTimeout(60, TimeUnit.SECONDS)
    .build();
  • Note: This is without keepAliveWithoutCalls.

Issue

We are still seeing the same error:

io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
Caused by: java.io.IOException: Connection reset by peer

Next steps we want to try

Server side

  • Enable PERMIT_KEEPALIVE_WITHOUT_CALLS = true

Client side

managedChannel = ManagedChannelBuilder
    .forAddress(serverHost, targetPort)
    .usePlaintext()
    .keepAliveTime(45, TimeUnit.SECONDS)
    .keepAliveTimeout(25, TimeUnit.SECONDS)
    .keepAliveWithoutCalls(true)
    .build();
  • If we encounter errors like Connection reset or StatusRuntimeException, we plan to retry the RPC call.

Keepalive Timeout and Connection Handling Scenario ( for 120/60 case,keepAliveWithoutCalls(false)

T+180s:

  • ✅ Keepalive timeout expires
  • ✅ gRPC marks connection as TRANSIENT_FAILURE
  • ❌ NO MORE PINGS - keepalive stops completely

T+180s+ (next RPC call):

  • Client calls method1RPC()
  • gRPC sees connection is dead
  • Attempts to reconnect (new TCP connection)
    • If reconnect fails → "Connection reset by peer"
      Question: Will gRPC handle this automatically or is it the application's responsibility to handle this error?
    • If reconnect succeeds → Happy path: RPC works, new keepalive cycle starts

I'm really curious—what happens on the gRPC side if it sees a dead or stale connection?

Questions

  1. Are there recommended values or patterns for keepAlive settings to avoid these issues?
  2. Is enabling PERMIT_KEEPALIVE_WITHOUT_CALLS on the server side advisable in this scenario?
  3. Any other suggestions to prevent Connection reset by peer errors?

Thank you for your help!

atmel21 avatar Sep 11 '25 04:09 atmel21

Connection reset will cause a TRANSIENT_ERROR for the channel and it will be retried with exponential backoff. The retry policy, including backoff parameters (initial backoff, max backoff, backoff multiplier), retryable status codes, and maximum attempts, can be configured in the gRPC service configuration. These settings exist so that the application code doesn't have to do the retrying by itself for retriable errors.

With 120s keepalive time and without keep alive without calls, the connection reset you are getting may be due to intermediary firewalls, proxies or NAT devices that reset the connection based on the idle setting they have. It looks like you should still use keep alive without calls and with a frequency for the pings set in such a way not to trigger DdOS.

kannanjgithub avatar Sep 19 '25 07:09 kannanjgithub

This is useless:

.keepAliveTime(10, TimeUnit.SECONDS)

We understand this can make the client a potential DDoS risk. We did see errors like ENHANCE_CALM.

Without the server permitting the keepalives, the grpc client will keep increasing the keepalive each ENHANCE_YOUR_CALM failure until it is greater than what the server allows (for grpc servers, defaults to 5 minutes). You aren't actually testing 10 second keepalive, because the client will slow down after the failures!

It sounds like you need the client keepalive time to be less than 5 minutes. The only way to make that work is for the server to permit such keepalive times. If using grpc-java for the server (without any L7 proxy), that'd be something like serverBuilder.permitKeepAliveTime(45, SECONDS).


Generally I'd suggest conservative (infrequent) keepalive on the server, just for garbage collection. If you're using a L4 load balancer/proxy that requires a certain activity frequency, then it does make sense to handle that at the server as well.

Often the most aggressive keepalive is needed because of the specific networks between the client and the server, such that only certain clients need more aggressive values. Also, it is generally the clients that are harmed most by any increase in latency; detecting a connection breakage on the server-side doesn't help the clients. But when you configure keepalive on the client, you still need the server to allow those more aggressive keepalives. Otherwise it serves no purpose.

ejona86 avatar Nov 11 '25 17:11 ejona86