jetcd icon indicating copy to clipboard operation
jetcd copied to clipboard

Unable to catch etcd cluster failure

Open Solom00n opened this issue 6 years ago • 21 comments

I'm using the 0.3.0 version of jetcd to observe an etcd cluster composed of 3 nodes in Docker. After setting up the Watch object, I tried to stop all 3 nodes so that I could handle errors within my microservice. At this point two things happened:

  1. The Watch object simply did not react to this change for a long time: no exceptions were raised. I wonder how jetcd checks the health of the etcd cluster: is there a mechanism under the hood similar to heartbeat that I can configure to capture the cluster's unavailability much faster?
  2. After a while I received a NullPointerException launched on WatchImpl's line 250 within the onError method (stream.onCompleted ()):
2019-03-16 18:33:18.168 ERROR 1 --- [ault-executor-1] io.grpc.internal.SerializingExecutor     : Exception while executing runnable io.grpc.inte
rnal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed@3e14b715
java.lang.NullPointerException: null
      at io.etcd.jetcd.WatchImpl$WatcherImpl.onError(WatchImpl.java:250) ~[jetcd-core-0.3.0.jar:na]
      at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:434) ~[grpc-stub-1.17.1.jar:1.17.1]
      at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39) ~[grpc-core-1.17.1.jar:1.17.1]
      at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23) ~[grpc-core-1.17.1.jar:1.17.1]
      at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40) ~[grpc-core-1.17.

      at io.etcd.jetcd.ClientConnectionManager$AuthTokenInterceptor$1$1.onClose(ClientConnectionManager.java:302) ~[jetcd-core-0.3.0.jar:na]
      at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39) ~[grpc-core-1.17.1.jar:1.17.1]
      at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23) ~[grpc-core-1.17.1.jar:1.17.1]
      at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40) ~[grpc-core-1.17.

      at io.grpc.internal.CensusStatsModule$StatsClientInterceptor$1$1.onClose(CensusStatsModule.java:694) ~[grpc-core-1.17.1.jar:1.17.1]
      at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39) ~[grpc-core-1.17.1.jar:1.17.1]
      at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23) ~[grpc-core-1.17.1.jar:1.17.1]
      at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40) ~[grpc-core-1.17.

      at io.grpc.internal.CensusTracingModule$TracingClientInterceptor$1$1.onClose(CensusTracingModule.java:397) ~[grpc-core-1.17.1.jar:1.17.1]
      at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:459) ~[grpc-core-1.17.1.jar:1.17.1]
      at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:63) ~[grpc-core-1.17.1.jar:1.17.1]
      at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:546) ~[grpc-core-1.17.1.jar:1.17.1]
      at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$600(ClientCallImpl.java:467) ~[grpc-core-1.17.1.jar:1.17.1]
      at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:584) ~[grpc-core-1.17.1.jar:1.1

      at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) ~[grpc-core-1.17.1.jar:1.17.1]
      at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123) ~[grpc-core-1.17.1.jar:1.17.1]
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_191]
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_191]
      at java.lang.Thread.run(Thread.java:748) [na:1.8.0_191]

It is rare that a system distributed over different geographical areas is completely unavailable but in this scenario, I still want to be able to cleanly manage the errors within my application. Thank you

Solom00n avatar Mar 16 '19 19:03 Solom00n

cc/ @lburgazzoli possibly related to watch too.

fanminshi avatar Mar 16 '19 19:03 fanminshi

@fanminshi @lburgazzoli Since I'm able to systematically reproduce this issue, I have created a small repository on github that can help you to debug and fix it. See jetcd-issue-532. The README.md file lists all the steps necessary to reproduce this issue and also explains how to reproduce the issue #524.

Regarding this specific issue, I suppose a good fix would be to raise a watch error that can be handled within the onError callback.

I hope it can be usefull. Let me know if you have any problems using the repository.

Solom00n avatar Apr 15 '19 18:04 Solom00n

I am having a similar issue. It is not clear to me how connection issues should be handled by a user of jetcd. I believe that the onError in watchImpl checks for a halt error or no leader error, neither of which match because the cluster is entirely unavailable. This would be useful to propagate to the caller because the caller may need to handle not having connectivity and act accordingly.

I'm not sure why stream.onCompleted() winds up being a null pointer exception, but it appears to stop the process of attempting to reconnect. This means the client will never connect after this error. Is there a way for a caller to check if the connection is still trying to be established? It may be beneficial for the caller to wait to complete some other work until after the connection is established.

sysadmind avatar Jun 13 '19 15:06 sysadmind

I found this problem also. Is there anybody can solve this problem?

sqtds avatar Aug 26 '19 12:08 sqtds

@Solom00n @sysadmind a new snapshot should be available soon, do you mind give it a try before I do a release ?

lburgazzoli avatar Sep 09 '19 11:09 lburgazzoli

@lburgazzoli I see you have solved #532 and #524
I used the latest release 0.5.0, but I watch a key when the etcd server shutdown the same error a appear could you show me how to reconnect etcd server demo thanks= =

Sanit1 avatar Apr 15 '20 16:04 Sanit1

@Sanit1 I've fixed only this issue, #524 has been closed because stale, as stated in that issue we have a test that perform a restart of etcd.

I'd really appreciate if you can provide a reproducer so I I can tests and fix it

lburgazzoli avatar Apr 16 '20 06:04 lburgazzoli

I'm also running into this issue (create a watch, kill the etcd server) - I believe from the logs that WatchImpl.onError gets called twice (once for the Goaway and again when the connection actually closes), and it's the second time in that fails because the first invocation null'ed out stream (and there's no null check on stream.onCompleted().

dstephens-wsgc avatar Jul 01 '20 21:07 dstephens-wsgc

I can confirm that modifying the WatchImpl.java from the jetcd-0.5.3 to include a null check on stream before invoking stream.onCompleted allows my project to recover from failed servers correctly, however it also causes a test failure during build, so it's possible that's fixing a symptom rather than the defect. I'll try and get to the bottom of that, and create a PR for a fix and/or a reproducer.

Current steps to reproduce are to open a prefix watch on the whole keyspace, then kill the etcd server process (etcd 3.4.4 on OSX 10.15.5), and restart etcd. The afforementioned NullPointerException prevents the watcher/client from ever reconnecting, even though etcd resumes operation almost immediately.

EDIT: After an mvn clean install, the NPE fix does not appear to be impacting any of the unit tests, the failures seemed to have been transient in nature. In my current application I'm able to explicitly restart the watch after a failure and with 0.5.3 things appear to recover. I still want to understand why the onError gets invoked twice since the gRPC documentation implies it should happen only once

dstephens-wsgc avatar Jul 02 '20 04:07 dstephens-wsgc

I have seen this NullpointerException too, but I'm not sure if just checking != null is sufficient.

If I remember correctly I have seen also a similar situation where my main-thread calls WatcherImpl.close() and fails with a NullPointerException there, because WatcherImpl.onError() was setting the stream to null in a grpc-thread in parallel. There might be a missing synchronization between onError() and close(), maybe the onError() could call closed.compareAndSet() instead just checking it. (Note there is a WatchImpl.close/d and WatcherImpl.close/d, easy to mix things up)

@lburgazzoli tagging you here, because the issue is "closed".

2020-07-03T10:00:15,732 [grpc-default-executor-1] ERROR io.grpc.internal.SerializingExecutor Exception while executing runnable io.grpc.internal. ClientCallImpl$ClientStreamListenerImpl$1StreamClosed@24dbc909 java.lang.NullPointerException: null at io.etcd.jetcd.WatchImpl$WatcherImpl.onError(WatchImpl.java:269) ~[jetcd-core-0.5.4-SNAPSHOT.jar!/:?] at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:460) ~[grpc-stub-1.30.2.jar!/:1.30.2] at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39) ~[grpc-api-1.30.2.jar!/:1.30.2] at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23) ~[grpc-api-1.30.2.jar!/:1.30.2] at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40) ~[grpc-api-1.30. 2.jar!/:1.30.2] at io.etcd.jetcd.ClientConnectionManager$AuthTokenInterceptor$1$1.onClose(ClientConnectionManager.java:347) ~[jetcd-core-0.5.4-SNAPSHOT.j ar!/:?] at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:426) ~[grpc-core-1.30.2.jar!/:1.30.2] at io.grpc.internal.ClientCallImpl.access$500(ClientCallImpl.java:66) ~[grpc-core-1.30.2.jar!/:1.30.2] at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:689) ~[grpc-core-1.30.2.jar!/:1.30.2] at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$900(ClientCallImpl.java:577) ~[grpc-core-1.30.2.jar!/:1.30.2] at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:751) ~[grpc-core-1.30.2.jar!/:1.30.2] at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:740) ~[grpc-core-1.30.2.jar!/:1.30.2] at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) ~[grpc-core-1.30.2.jar!/:1.30.2] at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123) [grpc-core-1.30.2.jar!/:1.30.2] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at java.lang.Thread.run(Thread.java:834) [?:?]

Excpt0r avatar Jul 03 '20 12:07 Excpt0r

@Excpt0r do you have any time to work on a PR ?

lburgazzoli avatar Jul 07 '20 06:07 lburgazzoli

@lburgazzoli depends on the solution. My suggestion to replace closed.get() with closed.compareAndSet(false, true) wasn't a good idea, this would break the resume() functionality called by onError(). I will try with a synchronized(WatcherImpl.this.closed) block in both onError() and close(), unless you have a better idea?

Excpt0r avatar Jul 08 '20 16:07 Excpt0r

I'm fine with the synchronization

lburgazzoli avatar Jul 08 '20 16:07 lburgazzoli

I am having a similar issue. It is not clear to me how connection issues should be handled by a user of jetcd. I believe that the onError in watchImpl checks for a halt error or no leader error, neither of which match because the cluster is entirely unavailable. This would be useful to propagate to the caller because the caller may need to handle not having connectivity and act accordingly.

I'm not sure why stream.onCompleted() winds up being a null pointer exception, but it appears to stop the process of attempting to reconnect. This means the client will never connect after this error. Is there a way for a caller to check if the connection is still trying to be established? It may be beneficial for the caller to wait to complete some other work until after the connection is established.

I have the same question and I'm sure the connection is still.

yingjieLi01 avatar Aug 19 '20 11:08 yingjieLi01

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Oct 24 '20 00:10 github-actions[bot]

@Excpt0r have you had any time to work on this ?

lburgazzoli avatar Oct 25 '20 09:10 lburgazzoli

Hi @lburgazzoli my fix was merged on Jul 13, but maybe that was not covering all aspects (I saw some more related changes on the master branch) But note that the new linked issue #795 shows jetcd v0.5.3 in the logs - the fix was not yet included there. Btw, are there any plans to release a 0.5.4? I'm still using the snapshot repo.

Excpt0r avatar Oct 25 '20 10:10 Excpt0r

oh, forgot it sorry, I'm gogin to do a nre release

lburgazzoli avatar Oct 26 '20 08:10 lburgazzoli

0.5.4 has been released top maven central

lburgazzoli avatar Oct 26 '20 09:10 lburgazzoli

@yingjieLi01 mind testing it ?

lburgazzoli avatar Oct 26 '20 09:10 lburgazzoli

i have similar problem, i'm using jetcd-core-0.5.7, etcd server version is 3, my application watching the etcd, after shut down etcd server about 30s, i get the below exception , maybe its jetcd's bug, could anyone tell me what version fix this problem or how to fix it.

2022-05-14 18:40:32.813 ERROR 9972 --- [ault-executor-3] io.grpc.internal.SerializingExecutor     : Exception while executing runnable io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed@1c23d1fc

java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@346ee392 rejected from java.util.concurrent.ScheduledThreadPoolExecutor@1037e922[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
	at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063) ~[na:1.8.0_222]
	at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830) [na:1.8.0_222]
	at java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:326) ~[na:1.8.0_222]
	at java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:533) ~[na:1.8.0_222]
	at com.google.common.util.concurrent.MoreExecutors$ScheduledListeningDecorator.schedule(MoreExecutors.java:548) ~[guava-20.0.jar:na]
	at io.etcd.jetcd.WatchImpl$WatcherImpl.reschedule(WatchImpl.java:297) ~[jetcd-core-0.5.7.jar:na]
	at io.etcd.jetcd.WatchImpl$WatcherImpl.handleError(WatchImpl.java:286) ~[jetcd-core-0.5.7.jar:na]
	at io.etcd.jetcd.WatchImpl$WatcherImpl.onError(WatchImpl.java:269) ~[jetcd-core-0.5.7.jar:na]
	at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:478) ~[grpc-stub-1.37.0.jar:1.37.0]
	at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39) ~[grpc-api-1.37.0.jar:1.37.0]
	at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23) ~[grpc-api-1.37.0.jar:1.37.0]
	at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40) ~[grpc-api-1.37.0.jar:1.37.0]
	at io.etcd.jetcd.ClientConnectionManager$AuthTokenInterceptor$1$1.onClose(ClientConnectionManager.java:395) ~[jetcd-core-0.5.7.jar:na]
	at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:553) ~[grpc-core-1.37.0.jar:1.37.0]
	at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:68) ~[grpc-core-1.37.0.jar:1.37.0]
	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:739) ~[grpc-core-1.37.0.jar:1.37.0]
	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:718) ~[grpc-core-1.37.0.jar:1.37.0]
	at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) ~[grpc-core-1.37.0.jar:1.37.0]
	at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123) ~[grpc-core-1.37.0.jar:1.37.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_222]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_222]
	at java.lang.Thread.run(Thread.java:748) [na:1.8.0_222]

and this is my watch function:

public void watch(String prefix, long startIndex, Consumer<EtcdEvent> onEvent, Consumer<Throwable> onError) {
        long nextIndex = startIndex > 0 ? startIndex : 0;
        ByteSequence prefixBytes = ByteSequence.from(prefix, StandardCharsets.UTF_8);
        WatchOption option = WatchOption.newBuilder().withPrefix(prefixBytes).withRevision(nextIndex).build();

        for (; ; ) {
            CountDownLatch latch = new CountDownLatch(1);
            try (Watch watchClient = client.getWatchClient(); Watch.Watcher ignored =
                    watchClient.watch(prefixBytes, option, response -> handleWatchResponse(response, onEvent), throwable -> {
                        onError.accept(throwable);
                        latch.countDown();
                        LOGGER.info("onexception========");
                    }, new Runnable() {
                        @Override
                        public void run() {
                            latch.countDown();
                            LOGGER.info("complete");
                        }
                    })) {
                latch.await();
            } catch (EtcdException e) {
                onError.accept(e);
                client = buildClient(getProperties());
            } catch (InterruptedException e) {
                onError.accept(e);
            } catch (Exception e) {
                LOGGER.error("watch error:", e);
            }
            LOGGER.info("create watch client again");
            delay(RETRY_DELAY);
        }
    }```

junzhiL avatar May 14 '22 10:05 junzhiL

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Dec 03 '22 00:12 github-actions[bot]

Hey @lburgazzoli, Any solution for the above issue mentioned by @junzhiL ?? I am getting this error whenever i close the client.

Apart from this, i was trying a reconnecting mechanism whenever etcd goes down. We have 3 etcd pod cluster where we follow leader follower mechanism. The code works fine when a follower goes down that is watch is able to reconnect to available etcd pods. But when the leader etcd pod goes down or the etcd cluster loses quorum, the function keeps on retrying but is never able to reconnect. Please let me know how to fix this isse. I use java 8. Please find the code below.

public void watchAndListen(HandlerWrapper<JsonObject> handler) {
    Watch.Listener listen = Watch.listener(watchHandler(handler),
        throwable -> {
          System.out.println("Exception in watch"+ throwable.getCause());
          if (throwable instanceof EtcdException) {
            // Retry mechanism
            watchAndListen(handler);
          }
        });
    Watch watchClient = etcdClient.getWatchClient();
    watchClient.watch(storeKey, listen);
  }

deekshith-n avatar Apr 12 '24 15:04 deekshith-n

I do not have time at this stage so I would recommend to open a new issue and try it out with the latest code and see if the problem still exists.

lburgazzoli avatar Apr 12 '24 15:04 lburgazzoli