bookkeeper icon indicating copy to clipboard operation
bookkeeper copied to clipboard

Add RegistrationSessionExpiredPolicy for session expired event

Open wenbingshen opened this issue 2 years ago • 7 comments

Motivation

The status of the Bookie service, including writable and read-only, needs to be registered with the Registration service. At present, most of the time we use zookeeper as the Registration service, where BookieStateManager is mainly responsible for managing these state switches, including the re-registration logic after processing the session expired event. BookieStateManager uses ZKRegistrationManager to register the state, where zk in ZKRegistrationManager is a subclass of Zookeeper to implement ZooKeeperClient. When session expired occurs,ZooKeeperClient will re-instantiate a new Zookeeper for zk. If the new zk is successfully constructed, the BookieStateManager will re-register the bookie state with zookeeper server.

Here we have a problem, from the existing unit tests and the actual show, we seem to want the bookie to be closed when the session expired occurs, because we can't prove whether the bookie can be in the correct working state just by reinitializing a new zk, refer to comments on the following issue: https://github.com/apache/bookkeeper/issues/3250#issuecomment-1190996890

Like pulsar, we can provide session expired poilicy to choose whether it should continue to reconnect.

Changes

Add RegistrationSessionExpiredPolicy and server-side configuration items: registrationSessionExpiredPolicy Contains two optional values: reconnect, shutdown

To stay consistent with the existing logic, this PR will set the policy to reconnect.

If we want to solve the problem mentioned in #3418 , we can modify the policy to shutdown here at #3418.

wenbingshen avatar Sep 20 '22 06:09 wenbingshen

ping @hangc0276 @dlg99 @Shoothzj @eolivelli PTAL. Thanks.

wenbingshen avatar Sep 20 '22 07:09 wenbingshen

@StevenLuMT Thanks for your review.

  1. what scenarios are these two enumeration values of registrationSessionExpiredPolicy suitable for?

This PR tends to introduce two policies reconnect, shutdown and keeps the same logic as before reconnect. It is found from the current unit testing and actual use, we should tend to use the shutdown policy. But I'm leaning towards this PR to keep the default reconnect policy and instead change to shutdown policy in #3418 to fix falky-test.

Because Bookie is difficult to recover from Session Expired, as described in the comment: https://github.com/apache/bookkeeper/issues/3250#issuecomment-1190996890

So the reconnect policy is still a policy to be implemented and verified. reconnect was introduced to keep the same logic as before. We can use another PR to implement the complete policy of Bookie Reconnect Recovery from Session Expired.

Here I divide it into three steps:

  1. This PR introduces two policy options
  2. Change the default policy to shutdown to solve flaky-test #3418
  3. Implement complete bookie recovery logic for the reconnect policy
  1. please set two testcases for different enumerations to cover different logic

I can introduce a test for shutdown Policy, and reconnect can reuse the previous testBookieServerZKSessionExpireBehaviour test, because reconnect is still an unstable recovery policy that can only be verified until the third point mentioned above is implemented: 3. Implement complete bookie recovery logic for the reconnect policy.

wenbingshen avatar Sep 20 '22 17:09 wenbingshen

Basically agree, we also have another way of removing the ReconnectPolicy, because it's difficult to implement and no one would want to implement it. We can discuss it in maillist. :)

shoothzj avatar Sep 21 '22 00:09 shoothzj

Basically agree, we also have another way of removing the ReconnectPolicy, because it's difficult to implement and no one would want to implement it. We can discuss it in maillist. :)

@Shoothzj Thanks for your review. I'll send a discussion email later.

wenbingshen avatar Sep 21 '22 02:09 wenbingshen

Also needs tests.

dlg99 avatar Sep 21 '22 18:09 dlg99

and you are correct, after session expired one has to re-create ZK client ("once a ZooKeeper object is closed or receives a fatal event (SESSION_EXPIRED and AUTH_FAILED), the ZooKeeper object becomes invalid." see https://zookeeper.apache.org/doc/r3.8.0/zookeeperProgrammers.html ) which is doable but needs to be bubbled up the chain of injected dependencies.

dlg99 avatar Sep 21 '22 18:09 dlg99

@dlg99 Thanks for your review. :)

LGTM but please add it the new config parameter to conf/bk_server.conf and document it there Also needs tests.

I will add it to the new config and doc later. Also I will add test. Thanks. :)

wenbingshen avatar Sep 23 '22 06:09 wenbingshen

IMO, we should keep the session expired policy as reconnect because Bookie is a heavy service with the state, and shutdown it when the session expired, will cause the BookKeeper cluster unstable.

Yes. I think we can introduce these two polices and keep reconnect as the default policy. When users encounter the instability of the bookie service due to session expiration, they can adjust to the shutdown policy, such as @dlg99's comment at https://github.com/apache/bookkeeper/issues/3250#issuecomment-1190996890

In Pulsar, we have changed the default session expired policy from shutdown to reconnect to make the Pulsar service more stable.

Yes, I believe this makes sense as long as the reconnect policy matures.

Currently, Bookie has issues with session expired, and we should enhance the reconnect policy instead of introducing a shutdown policy.

We can keep the reconnect policy as the default, and provide a shutdown policy option until the reconnect is perfected and stabilized.

@hangc0276 PTAL.

wenbingshen avatar Oct 18 '22 02:10 wenbingshen