bookkeeper fix Flaky-test: BookieZKExpireTest.testBookieServerZKSessionExpireBehaviour

Motivation

fix Flaky-test: BookieZKExpireTest.testBookieServerZKSessionExpireBehaviour by disable retry session expired.

issue #3206 was fixed by PR #3415, but it still introduced another flaky test: BookieZKExpireTest.testBookieServerZKSessionExpireBehaviour

According to the investigation, the reason is: ZookeeperClient still creates a new zookeeper instance when the old zookeeper client session time out.

Due to the asynchronous execution of two threads executing bookie temporary node re-registration and zk instance re-creation, the test program sometimes succeeds and sometimes fails.

When the temporary node re-registration is performed before the zk re-instantiation, the temporary node creation will use the old zk instance, which will cause a session timeout error, the bookie service will be shutdown, and the test will be successful;
When the zk re-instantiation precedes the re-registration of the temporary node, the temporary node creation will use the new re-instantiated zk instance, then the temporary node will be successfully created, the bookie service is running normally, and the test fails.

        try {
            connectExecutor.submit(clientCreator);
        } catch (RejectedExecutionException ree) {
            if (!closed.get()) {
                logger.error("ZooKeeper reconnect task is rejected : ", ree);
            }
        } catch (Exception t) {
            logger.error("Failed to submit zookeeper reconnect task due to runtime exception : ", t);
        }

Changes

Add a retryExpired flag to indicate whether to run the zk instance and retry to create a new instance after the session times out. Set this flag to false for ZKMetadataBookieDriver; Other ZookeeperClient and normal ZookeeperClient applications will generate the default value true or set to true, which is consistent with the original behavior.

Test the behavior of this PR:

Before this PR: Executed the test 10 times, all failed.

After this PR: Executed the test 10 times, all successful.

Jul 23 '22 08:07 wenbingshen

ping @dlg99 @hangc0276 @Shoothzj @eolivelli PTAL.

Jul 23 '22 08:07 wenbingshen

@StevenLuMT PTAL

Jul 23 '22 11:07 HQebupt

can you predictably repro the problem? Maybe my fix was not good enough :( but I only could repro it under docker with limited number of CPUs and after the fix it did not fail after running in the loop 150 times. BookieStateManager's registration listener will shutdown the bookie if it cannot reconnect immediately so I guess there is still some timing issue between it and the test killing zk session / letting it recover.

Jul 26 '22 00:07 dlg99

fix old workflow,please see #3455 for detail

Aug 24 '22 07:08 StevenLuMT

@dlg99

can you predictably repro the problem? Maybe my fix was not good enough :( but I only could repro it under docker with limited number of CPUs and after the fix it did not fail after running in the loop 150 times. BookieStateManager's registration listener will shutdown the bookie if it cannot reconnect immediately so I guess there is still some timing issue between it and the test killing zk session / letting it recover.

When I run the test in my local, it is very flaky

Dec 05 '23 09:12 poorbarcode

@dlg99

can you predictably repro the problem? Maybe my fix was not good enough :( but I only could repro it under docker with limited number of CPUs and after the fix it did not fail after running in the loop 150 times. BookieStateManager's registration listener will shutdown the bookie if it cannot reconnect immediately so I guess there is still some timing issue between it and the test killing zk session / letting it recover.

When I run the test in my local, it is very flaky

@poorbarcode There is a race condition describe in the motivation.

Dec 06 '23 01:12 hangc0276