[fix][test] fix Flaky-test: ExtensibleLoadManagerImplTest.testLoadBalancerServiceUnitTableViewSyncer
Fixes #24357
Main Issue: #24357
Motivation
The unit test failed because loadManager4.getBrokerRegistry().unregister(); is not enough to ensure a successful unregister. The root cause is that broker will register it self after unregister. This happens in two place:
- There is a scheduler task named monitorTask in
ExtensibleLoadManagerImpl, it will periodically register self. - When brokerRegistry start, it will register a listener. Once broker unregistered, it will callback
handleMetadataStoreNotificationand try to register itself again.
Modifications
The modifications contains three part:
- changes not in testLoadBalancerServiceUnitTableViewSyncer is just for formatting. Please ignore it!
- changes to resolve this issue is:
// simulate pulsar4 store error which will unregister the broker
var brokerRegistry = loadManager4.getBrokerRegistry();
var brokerLookupDataMetadataCache = (MetadataCache<BrokerLookupData>) spy(
FieldUtils.readField(brokerRegistry, "brokerLookupDataMetadataCache", true));
doReturn(CompletableFuture.failedFuture(new MetadataStoreException("store error"))).when(
brokerLookupDataMetadataCache).put(anyString(), any(), any());
FieldUtils.writeDeclaredField(brokerRegistry, "brokerLookupDataMetadataCache",
brokerLookupDataMetadataCache, true);
brokerRegistry.unregister();
NamespaceName slaMonitorNamespace =
getSLAMonitorNamespace(pulsar4.getBrokerId(), pulsar.getConfiguration());
String slaMonitorTopic = slaMonitorNamespace.getPersistentTopicName("test");
Awaitility.await()
.atMost(10, TimeUnit.SECONDS)
.pollInterval(1, TimeUnit.SECONDS)
.untilAsserted(() -> {
String currentResult = pulsar.getAdminClient().lookups().lookupTopic(slaMonitorTopic);
assertNotNull(currentResult);
log.info("{} Namespace is re-owned by {}", slaMonitorTopic, currentResult);
assertNotEquals(currentResult, pulsar4.getBrokerServiceUrl());
});
- I have found other factor to fail the test when I debug in
testLoadBalancerServiceUnitTableViewSyncer. The syncer is not started and it maybe cause exception sometime. So I changed it to:
String syncerTyp = serviceUnitStateTableViewClassName.equals(ServiceUnitStateTableViewImpl.class.getName()) ?
"SystemTopicToMetadataStoreSyncer" : "MetadataStoreToSystemTopicSyncer";
pulsar.getAdminClient().brokers()
.updateDynamicConfiguration("loadBalancerServiceUnitTableViewSyncer", syncerTyp);
Awaitility.waitAtMost(10, TimeUnit.SECONDS)
.pollInterval(1, TimeUnit.SECONDS)
.untilAsserted(() -> {
assertTrue(pulsar1.getConfiguration().isLoadBalancerServiceUnitTableViewSyncerEnabled());
assertTrue(pulsar2.getConfiguration().isLoadBalancerServiceUnitTableViewSyncerEnabled());
});
// We invoke monitor method to ensure SystemTopicToMetadataStoreSyncer to start or close because syncer will not be started or close after pulsar.getAdminClient().brokers().updateDynamicConfiguration();
.
primaryLoadManager.monitor();
secondaryLoadManager.monitor();
// We invoke monitor method in makePrimaryAsLeader and makeSecondaryAsLeader because monitor can immediately trigger serviceUnitStateTableViewSyncer to start or close without wait.
makeSecondaryAsLeader();
makePrimaryAsLeader();
assertTrue(primaryLoadManager.getServiceUnitStateTableViewSyncer().isActive());
assertFalse(secondaryLoadManager.getServiceUnitStateTableViewSyncer().isActive());
Verifying this change
- [x] Make sure that the change passes the CI checks.
Does this pull request potentially affect one of the following parts:
If the box was checked, please highlight the changes
- [ ] Dependencies (add or upgrade a dependency)
- [ ] The public API
- [ ] The schema
- [ ] The default values of configurations
- [ ] The threading model
- [ ] The binary protocol
- [ ] The REST endpoints
- [ ] The admin CLI options
- [ ] The metrics
- [ ] Anything that affects deployment
Documentation
- [ ]
doc - [ ]
doc-required - [x]
doc-not-needed - [ ]
doc-complete
Matching PR in forked repository
PR in forked repository:
@berg223 Please add the following content to your PR description and select a checkbox:
- [ ] `doc` <!-- Your PR contains doc changes -->
- [ ] `doc-required` <!-- Your PR changes impact docs and you will update later -->
- [ ] `doc-not-needed` <!-- Your PR changes do not impact docs -->
- [ ] `doc-complete` <!-- Docs have been already added -->
/pulsarbot rerun-failure-checks
/pulsarbot run-failure-checks
Could you please approve the PR to be tested? Is there any concerns? cc @lhotari @nodece
Could you please approve the PR to be tested? Is there any concerns? cc @lhotari @nodece
@berg223 Not really any concerns, but I just haven't had a chance to review yet. Regarding testing PRs, you should setup "Personal CI" to be able to run CI on your own while you are working on the PR. It's about enabling GitHub Actions in your own fork and opening a second PR in your own fork for the same PR branch to be able to run tests without approvals. This is explained in https://pulsar.apache.org/contribute/personal-ci/.
@berg223 Regarding some other ExtensibleLoadManagerImpl related tests (in ExtensibleLoadManagerImplTest and in ServiceUnitStateChannelTest), one detail is about inconsistencies in how the test code changes the primary/leader by closing LeaderElectionImpl. LeaderElectionImpl should stop election operations after it's closed, but making that change will break tests. The draft change is https://github.com/apache/pulsar/pull/23995.
I was planning to handle to problem by introducing an internal control interface for LeaderElection
public interface LeaderElectionControl {
void releasePossibleLeaderRole();
void skipElections();
void resumeElections();
}
tests could use this instead of using hacks which they currently rely on. Releasing leadership could be useful also in production code later if there would be a need to let go of the leadership role on one broker. However, this would be initially targeted for tests and not exposed. the WIP changes are in https://github.com/apache/pulsar/compare/master...lhotari:pulsar:lh-fix-LeaderElectionImpl-close-wip
Could you please approve the PR to be tested? Is there any concerns? cc @lhotari @nodece
@berg223 Not really any concerns, but I just haven't had a chance to review yet. Regarding testing PRs, you should setup "Personal CI" to be able to run CI on your own while you are working on the PR. It's about enabling GitHub Actions in your own fork and opening a second PR in your own fork for the same PR branch to be able to run tests without approvals. This is explained in https://pulsar.apache.org/contribute/personal-ci/.
Thanks for detail instructments. It's very valueful for me!
@berg223 Regarding some other ExtensibleLoadManagerImpl related tests (in ExtensibleLoadManagerImplTest and in ServiceUnitStateChannelTest), one detail is about inconsistencies in how the test code changes the primary/leader by closing LeaderElectionImpl. LeaderElectionImpl should stop election operations after it's closed, but making that change will break tests. The draft change is #23995.
I was planning to handle to problem by introducing an internal control interface for LeaderElection
public interface LeaderElectionControl { void releasePossibleLeaderRole(); void skipElections(); void resumeElections(); }tests could use this instead of using hacks which they currently rely on. Releasing leadership could be useful also in production code later if there would be a need to let go of the leadership role on one broker. However, this would be initially targeted for tests and not exposed. the WIP changes are in master...lhotari:pulsar:lh-fix-LeaderElectionImpl-close-wip
We are spying the case that our broker cannot connect to metadata store and the leadership hasn't change. The expected behavior should be that the loadbalanceManager of leader will reassign the bundles to other broker. So we won't use LeaderElectionControl in this case.
However, we can consider the clear and resuseable way to spy this case similiar to LeaderElectionControl. But I don't have a good idea for now.
@lhotari I have setup the Personal CI. And all CI succeed except Upload Coverage (pull_request) . Is it okay to begin the testing after this PR reviewed?