zookeeper icon indicating copy to clipboard operation
zookeeper copied to clipboard

ZOOKEEPER-4203: Leader swallows the ZooKeeperServer.State.ERROR from Leader.LearnerCnxAcceptor in some concurrency condition

Open functioner opened this issue 4 years ago • 12 comments

The fix of ZOOKEEPER-4203 is implemented. In my local machine, it is able to pass all test cases.

functioner avatar Feb 06 '21 21:02 functioner

I believe that in the CI server, multiple test cases fail due to the network connection binding issue, and the QuorumMainTest fails because the ZK snapshot is compromised by some test cases that have failed. Thus, I think this patch can be merge into master if the reviewers agree with my design.

functioner avatar Feb 06 '21 22:02 functioner

retest please

tisonkun avatar Feb 24 '21 13:02 tisonkun

retest please

tisonkun avatar Feb 24 '21 13:02 tisonkun

@TisonKun I am sorry but the magic word does not work anymore. Let me force a rerun manually

eolivelli avatar Feb 24 '21 13:02 eolivelli

I have restarted the github actions job, but the Jenkins job is too old. I guess you have to rebase/merge with current master and push

eolivelli avatar Feb 24 '21 13:02 eolivelli

One thing I am wondering (which was not introduced by your code) is why we catch and continue on SaslExceptions, but "crash" on other IOExceptions? Not a major point as the server will now recover, but perhaps we could just accept that accept can fail? @eolivelli, @symat?

@ztzg I think it can be explained with: https://github.com/apache/zookeeper/blob/0c98d1d3252e645a5c25bfecaba5ca1f5cd3258e/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/Leader.java#L523-L536 In the case of SaslException, the exception is not thrown again, then it will not be caught by: https://github.com/apache/zookeeper/blob/0c98d1d3252e645a5c25bfecaba5ca1f5cd3258e/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/Leader.java#L498-L504 In this case, ZooKeeperServer.State.ERROR is not set, everything works well temporally, and the critical code https://github.com/apache/zookeeper/blob/0c98d1d3252e645a5c25bfecaba5ca1f5cd3258e/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/Leader.java#L688 finishes, so, if there is any error later, it can be detect by: https://github.com/apache/zookeeper/blob/0c98d1d3252e645a5c25bfecaba5ca1f5cd3258e/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/Leader.java#L754 Then, the quorum can handle it well.

However, if the IOException is not SaslException, it will be thrown again by (line 528 or 535): https://github.com/apache/zookeeper/blob/0c98d1d3252e645a5c25bfecaba5ca1f5cd3258e/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/Leader.java#L523-L536 In this case, the ZooKeeperServer.State.ERROR may be set before the critical code: https://github.com/apache/zookeeper/blob/0c98d1d3252e645a5c25bfecaba5ca1f5cd3258e/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/Leader.java#L688 Then this error state is covered by the running state. The symptom described by this issue occurs.

functioner avatar Mar 06 '21 16:03 functioner

I think the exception in Leader.LearnerCnxAcceptor should not be handled by zkServer, but should be handled internally by Leader.Because Acceptor and zkServer are not closely related, both Acceptor and zkServer are used as internal components of Leader, and Leader combines them to complete the function of leader. When an exception occurs in Leader.LearnerCnxAcceptor,the error status should be fed back to Leader, that is, Leader.isRunning can find this error status. My solution:

  • Leader.isRunning image
  • Make Leader.LearnerCnxAcceptor extends ZooKeeperThread image

Follower can do the same.

lanicc avatar Mar 21 '21 10:03 lanicc

Follower can do the same.

@lanicc, IMO, the root problem this issue exposes is that the exception caught by ZooKeeperCriticalThread is translated into the state change in zkServer. Even if you deal with Acceptor separately in Leader. There could be other similar issues in ZooKeeperCriticalThread, affecting Leader or Follower. Either way, the logic of state change needs improvement, either in zkServer, or in Leader/Follower as you proposed. Maybe the fix you provided here is reasonable, but I think it should be a separate PR, dealing with Acceptor. The current PR should focus on improving the state change logic.

What do you think? @eolivelli @ztzg

functioner avatar Mar 25 '21 00:03 functioner

I've added LOG.warn when this case is happening. You can take a look @eolivelli @ztzg @lanicc Thanks for the test case you provide! It inspires me a lot. I implemented a test case without using Byteman. It's basically overriding lots of methods; thanks @ztzg for the idea. Anybody can approve the CI workflow?

functioner avatar Nov 25 '21 03:11 functioner

@maoling Sorry to bother you. But the other guys might be busy recently. Can you help review my patch & test case? And approve the CI workflow? Thank you!

functioner avatar Nov 30 '21 05:11 functioner

@eolivelli The state check & modification within setState method should be atomic as a whole, so AtomicReference is not enough (including the updateAndGet methods and so on). Now I use synchronized for the methods which read/write state.

functioner avatar Dec 02 '21 01:12 functioner

@eolivelli @ztzg I'm following up on whether we will merge or improve this patch soon? Thanks!

functioner avatar Jul 18 '22 04:07 functioner