bookkeeper
bookkeeper copied to clipboard
Add RegistrationSessionExpiredPolicy for session expired event
Motivation
The status of the Bookie service, including writable
and read-only
, needs to be registered with the Registration
service.
At present, most of the time we use zookeeper
as the Registration
service, where BookieStateManager
is mainly responsible for managing these state switches, including the re-registration
logic after processing the session expired event
.
BookieStateManager
uses ZKRegistrationManager
to register the state, where zk
in ZKRegistrationManager
is a subclass of Zookeeper
to implement ZooKeeperClient
. When session expired
occurs,ZooKeeperClient
will re-instantiate a new Zookeeper
for zk
.
If the new zk
is successfully constructed, the BookieStateManager
will re-register the bookie state with zookeeper server.
Here we have a problem, from the existing unit tests and the actual show, we seem to want the bookie to be closed when the session expired occurs, because we can't prove whether the bookie can be in the correct working state just by reinitializing a new zk, refer to comments on the following issue: https://github.com/apache/bookkeeper/issues/3250#issuecomment-1190996890
Like pulsar, we can provide session expired poilicy to choose whether it should continue to reconnect.
Changes
Add RegistrationSessionExpiredPolicy
and server-side configuration items: registrationSessionExpiredPolicy
Contains two optional values: reconnect, shutdown
To stay consistent with the existing logic, this PR will set the policy to reconnect
.
If we want to solve the problem mentioned in #3418 , we can modify the policy to shutdown here at #3418.
ping @hangc0276 @dlg99 @Shoothzj @eolivelli PTAL. Thanks.
@StevenLuMT Thanks for your review.
- what scenarios are these two enumeration values of registrationSessionExpiredPolicy suitable for?
This PR tends to introduce two policies reconnect, shutdown
and keeps the same logic as before reconnect
.
It is found from the current unit testing and actual use, we should tend to use the shutdown
policy.
But I'm leaning towards this PR to keep the default reconnect
policy and instead change to shutdown
policy in #3418 to fix falky-test.
Because Bookie is difficult to recover from Session Expired, as described in the comment: https://github.com/apache/bookkeeper/issues/3250#issuecomment-1190996890
So the reconnect policy is still a policy to be implemented and verified. reconnect
was introduced to keep the same logic as before. We can use another PR to implement the complete policy of Bookie Reconnect Recovery
from Session Expired.
Here I divide it into three steps:
- This PR introduces two policy options
- Change the default policy to
shutdown
to solve flaky-test #3418 - Implement complete bookie recovery logic for the
reconnect
policy
- please set two testcases for different enumerations to cover different logic
I can introduce a test for shutdown
Policy, and reconnect
can reuse the previous testBookieServerZKSessionExpireBehaviour
test, because reconnect
is still an unstable recovery policy that can only be verified until the third point mentioned above is implemented: 3. Implement complete bookie recovery logic for the reconnect
policy.
Basically agree, we also have another way of removing the ReconnectPolicy
, because it's difficult to implement and no one would want to implement it. We can discuss it in maillist. :)
Basically agree, we also have another way of removing the
ReconnectPolicy
, because it's difficult to implement and no one would want to implement it. We can discuss it in maillist. :)
@Shoothzj Thanks for your review. I'll send a discussion email later.
Also needs tests.
and you are correct, after session expired one has to re-create ZK client ("once a ZooKeeper object is closed or receives a fatal event (SESSION_EXPIRED and AUTH_FAILED), the ZooKeeper object becomes invalid." see https://zookeeper.apache.org/doc/r3.8.0/zookeeperProgrammers.html ) which is doable but needs to be bubbled up the chain of injected dependencies.
@dlg99 Thanks for your review. :)
LGTM but please add it the new config parameter to conf/bk_server.conf and document it there Also needs tests.
I will add it to the new config and doc later. Also I will add test. Thanks. :)
IMO, we should keep the session expired policy as
reconnect
because Bookie is a heavy service with the state, and shutdown it when the session expired, will cause the BookKeeper cluster unstable.
Yes. I think we can introduce these two polices and keep reconnect
as the default policy. When users encounter the instability of the bookie service due to session expiration, they can adjust to the shutdown
policy, such as @dlg99's comment at https://github.com/apache/bookkeeper/issues/3250#issuecomment-1190996890
In Pulsar, we have changed the default session expired policy from
shutdown
toreconnect
to make the Pulsar service more stable.
Yes, I believe this makes sense as long as the reconnect policy matures.
Currently, Bookie has issues with session expired, and we should enhance the reconnect policy instead of introducing a shutdown policy.
We can keep the reconnect policy as the default, and provide a shutdown policy option until the reconnect
is perfected and stabilized.
@hangc0276 PTAL.