nats.java icon indicating copy to clipboard operation
nats.java copied to clipboard

Connection failures during NATS servers rolling upgrade

Open ajax-surovskyi-y opened this issue 2 months ago • 0 comments

Observed behavior

During the rolling upgrade process of our NATS servers, which consists of three nodes, several of our services were unable to reconnect to the NATS server. This issue occurred unexpectedly and affected the normal operation of the services. Investigation revealed several errors common across all services(java.net.SocketException: Connection reset , java.net.SocketException: Broken pipe, java.net.ConnectException: Connection refused), including those that successfully reconnected. However, a specific exception, java.util.concurrent.CancellationException, was found only on the nodes that failed to re-establish the connection.

The following exception was identified as unique to the failing nodes and might be directly related to the reconnection failures:

java.util.concurrent.CancellationException: null
	at java.util.concurrent.CompletableFuture.cancel(CompletableFuture.java:2510)
	at io.nats.client.impl.NatsConnection.cleanUpPongQueue(NatsConnection.java:721)
	at io.nats.client.impl.NatsConnection.closeSocketImpl(NatsConnection.java:702)
	at io.nats.client.impl.NatsConnection.closeSocket(NatsConnection.java:566)
	at io.nats.client.impl.NatsConnection.lambda$handleCommunicationIssue$3(NatsConnection.java:540)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
	at java.util.concurrent.FutureTask.run(FutureTask.java:317)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.lang.Thread.run(Thread.java:1583)

Upon reviewing the source code of the library implicated in the error, I identified the method referenced by the stack trace:

void cleanUpPongQueue() {
    Future<Boolean> b;
    while ((b = pongQueue.poll()) != null) {
        try {
            b.cancel(true);
        } catch (CancellationException e) {
            if (!b.isDone() && !b.isCancelled()) {
                processException(e);
            }
        }
    }
}

It appears unnecessary to wrap the b.cancel(true) call within a try-catch block for CancellationException, as this exception does not occur during the cancel call but rather during get or join operations on a cancelled CompetableFuture. Further analysis of the code revealed that the CancellationException may not clearly indicate the actual failure point, as it tends to signal where the Future was cancelled, not where it was actively thrown. I identified only the one method where the get call on the CompetableFuture is used:

void tryToConnect(String serverURI, long now) {
    this.currentServer = null;
    try {
    ...
      Future<Boolean> pongFuture = sendPing();
  
      if (pongFuture != null) {
          pongFuture.get(timeoutNanos, TimeUnit.NANOSECONDS);
      }
    ...
    } catch (RuntimeException exp) { // runtime exceptions, like illegalArgs
        processException(exp);
        throw exp;
    } catch (Exception exp) { // everything else
        processException(exp);
        try {
            this.closeSocket(false);
        } catch (InterruptedException e) {
            processException(e);
        }
    }
    ...
}

In this scenario, if a CancellationException arises from the pongFuture.get call, it would typically be caught by the RuntimeException catch block. Notably, there is no specific handling for CancellationException, which is critical as this exception can disrupt the normal reconnect logic. If the reconnect logic is interrupted without proper exception handling, it may cause the reconnection process to halt completely, thus failing to re-establish connectivity.

It's a contentious decision to perform a rethrow if it is not handled higher up in the call stack. If there are no objections, I would make a contribution where I would remove this catch block.

Expected behavior

Automatically reconnect to the NATS server without any issues during the rolling upgrade.

Server and client version

nats-server 2.10.14 nats 2.15.3

Host environment

jvm amazon corretto 21.0.2

Steps to reproduce

  • Configure the cluster to handle over 2 million subscriptions, with each connection to the NATS having approximately 100,000 subscriptions.
  • Simulate a workload of approximately 1,000 messages sent and 1,000 messages received per second for each connection.
  • Ensure the NATS cluster processes approximately 4 MB per second of both incoming and outgoing traffic.
  • Initiate a rolling upgrade of the NATS server.
  • Monitor the services connected to the NATS server for any connection failures. Reproducing this behavior in a test environment can be challenging and may depend on specific load conditions that are difficult to replicate precisely. The issue has been observed under conditions of high load and may not manifest under lower or different load patterns.

ajax-surovskyi-y avatar May 12 '24 18:05 ajax-surovskyi-y