pulsar icon indicating copy to clipboard operation
pulsar copied to clipboard

[fix][test] fix Flaky-test: ExtensibleLoadManagerImplTest.testLoadBalancerServiceUnitTableViewSyncer

Open berg223 opened this issue 6 months ago • 9 comments

Fixes #24357

Main Issue: #24357

Motivation

The unit test failed because loadManager4.getBrokerRegistry().unregister(); is not enough to ensure a successful unregister. The root cause is that broker will register it self after unregister. This happens in two place:

  1. There is a scheduler task named monitorTask in ExtensibleLoadManagerImpl, it will periodically register self.
  2. When brokerRegistry start, it will register a listener. Once broker unregistered, it will callback handleMetadataStoreNotification and try to register itself again.

Modifications

The modifications contains three part:

  1. changes not in testLoadBalancerServiceUnitTableViewSyncer is just for formatting. Please ignore it!
  2. changes to resolve this issue is:
               // simulate pulsar4 store error which will unregister the broker
                var brokerRegistry = loadManager4.getBrokerRegistry();
                var brokerLookupDataMetadataCache = (MetadataCache<BrokerLookupData>) spy(
                        FieldUtils.readField(brokerRegistry, "brokerLookupDataMetadataCache", true));
                doReturn(CompletableFuture.failedFuture(new MetadataStoreException("store error"))).when(
                        brokerLookupDataMetadataCache).put(anyString(), any(), any());
                FieldUtils.writeDeclaredField(brokerRegistry, "brokerLookupDataMetadataCache",
                        brokerLookupDataMetadataCache, true);
                brokerRegistry.unregister();

                NamespaceName slaMonitorNamespace =
                        getSLAMonitorNamespace(pulsar4.getBrokerId(), pulsar.getConfiguration());
                String slaMonitorTopic = slaMonitorNamespace.getPersistentTopicName("test");

                Awaitility.await()
                        .atMost(10, TimeUnit.SECONDS)
                        .pollInterval(1, TimeUnit.SECONDS)
                        .untilAsserted(() -> {
                            String currentResult = pulsar.getAdminClient().lookups().lookupTopic(slaMonitorTopic);
                            assertNotNull(currentResult);
                            log.info("{} Namespace is re-owned by {}", slaMonitorTopic, currentResult);
                            assertNotEquals(currentResult, pulsar4.getBrokerServiceUrl());
                        });
  1. I have found other factor to fail the test when I debug in testLoadBalancerServiceUnitTableViewSyncer. The syncer is not started and it maybe cause exception sometime. So I changed it to:
String syncerTyp = serviceUnitStateTableViewClassName.equals(ServiceUnitStateTableViewImpl.class.getName()) ?
                "SystemTopicToMetadataStoreSyncer" : "MetadataStoreToSystemTopicSyncer";
        pulsar.getAdminClient().brokers()
                .updateDynamicConfiguration("loadBalancerServiceUnitTableViewSyncer", syncerTyp);
        Awaitility.waitAtMost(10, TimeUnit.SECONDS)
                .pollInterval(1, TimeUnit.SECONDS)
                .untilAsserted(() -> {
                    assertTrue(pulsar1.getConfiguration().isLoadBalancerServiceUnitTableViewSyncerEnabled());
                    assertTrue(pulsar2.getConfiguration().isLoadBalancerServiceUnitTableViewSyncerEnabled());
                });

        // We invoke monitor method to ensure SystemTopicToMetadataStoreSyncer to start or close because syncer will not be started or close after pulsar.getAdminClient().brokers().updateDynamicConfiguration();
.
        primaryLoadManager.monitor();
        secondaryLoadManager.monitor();

       // We invoke monitor method in makePrimaryAsLeader and makeSecondaryAsLeader because monitor can immediately trigger serviceUnitStateTableViewSyncer to start or close without wait.
        makeSecondaryAsLeader();
        makePrimaryAsLeader();
        assertTrue(primaryLoadManager.getServiceUnitStateTableViewSyncer().isActive());
        assertFalse(secondaryLoadManager.getServiceUnitStateTableViewSyncer().isActive());

Verifying this change

  • [x] Make sure that the change passes the CI checks.

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • [ ] Dependencies (add or upgrade a dependency)
  • [ ] The public API
  • [ ] The schema
  • [ ] The default values of configurations
  • [ ] The threading model
  • [ ] The binary protocol
  • [ ] The REST endpoints
  • [ ] The admin CLI options
  • [ ] The metrics
  • [ ] Anything that affects deployment

Documentation

  • [ ] doc
  • [ ] doc-required
  • [x] doc-not-needed
  • [ ] doc-complete

Matching PR in forked repository

PR in forked repository:

berg223 avatar Jun 02 '25 09:06 berg223

@berg223 Please add the following content to your PR description and select a checkbox:

- [ ] `doc` <!-- Your PR contains doc changes -->
- [ ] `doc-required` <!-- Your PR changes impact docs and you will update later -->
- [ ] `doc-not-needed` <!-- Your PR changes do not impact docs -->
- [ ] `doc-complete` <!-- Docs have been already added -->

github-actions[bot] avatar Jun 02 '25 09:06 github-actions[bot]

/pulsarbot rerun-failure-checks

lhotari avatar Jun 02 '25 15:06 lhotari

/pulsarbot run-failure-checks

berg223 avatar Jun 02 '25 18:06 berg223

Could you please approve the PR to be tested? Is there any concerns? cc @lhotari @nodece

berg223 avatar Jun 06 '25 01:06 berg223

Could you please approve the PR to be tested? Is there any concerns? cc @lhotari @nodece

@berg223 Not really any concerns, but I just haven't had a chance to review yet. Regarding testing PRs, you should setup "Personal CI" to be able to run CI on your own while you are working on the PR. It's about enabling GitHub Actions in your own fork and opening a second PR in your own fork for the same PR branch to be able to run tests without approvals. This is explained in https://pulsar.apache.org/contribute/personal-ci/.

lhotari avatar Jun 06 '25 07:06 lhotari

@berg223 Regarding some other ExtensibleLoadManagerImpl related tests (in ExtensibleLoadManagerImplTest and in ServiceUnitStateChannelTest), one detail is about inconsistencies in how the test code changes the primary/leader by closing LeaderElectionImpl. LeaderElectionImpl should stop election operations after it's closed, but making that change will break tests. The draft change is https://github.com/apache/pulsar/pull/23995.

I was planning to handle to problem by introducing an internal control interface for LeaderElection

public interface LeaderElectionControl {
    void releasePossibleLeaderRole();
    void skipElections();
    void resumeElections();
}

tests could use this instead of using hacks which they currently rely on. Releasing leadership could be useful also in production code later if there would be a need to let go of the leadership role on one broker. However, this would be initially targeted for tests and not exposed. the WIP changes are in https://github.com/apache/pulsar/compare/master...lhotari:pulsar:lh-fix-LeaderElectionImpl-close-wip

lhotari avatar Jun 06 '25 07:06 lhotari

Could you please approve the PR to be tested? Is there any concerns? cc @lhotari @nodece

@berg223 Not really any concerns, but I just haven't had a chance to review yet. Regarding testing PRs, you should setup "Personal CI" to be able to run CI on your own while you are working on the PR. It's about enabling GitHub Actions in your own fork and opening a second PR in your own fork for the same PR branch to be able to run tests without approvals. This is explained in https://pulsar.apache.org/contribute/personal-ci/.

Thanks for detail instructments. It's very valueful for me!

berg223 avatar Jun 08 '25 06:06 berg223

@berg223 Regarding some other ExtensibleLoadManagerImpl related tests (in ExtensibleLoadManagerImplTest and in ServiceUnitStateChannelTest), one detail is about inconsistencies in how the test code changes the primary/leader by closing LeaderElectionImpl. LeaderElectionImpl should stop election operations after it's closed, but making that change will break tests. The draft change is #23995.

I was planning to handle to problem by introducing an internal control interface for LeaderElection

public interface LeaderElectionControl {
    void releasePossibleLeaderRole();
    void skipElections();
    void resumeElections();
}

tests could use this instead of using hacks which they currently rely on. Releasing leadership could be useful also in production code later if there would be a need to let go of the leadership role on one broker. However, this would be initially targeted for tests and not exposed. the WIP changes are in master...lhotari:pulsar:lh-fix-LeaderElectionImpl-close-wip

We are spying the case that our broker cannot connect to metadata store and the leadership hasn't change. The expected behavior should be that the loadbalanceManager of leader will reassign the bundles to other broker. So we won't use LeaderElectionControl in this case.

However, we can consider the clear and resuseable way to spy this case similiar to LeaderElectionControl. But I don't have a good idea for now.

berg223 avatar Jun 08 '25 08:06 berg223

@lhotari I have setup the Personal CI. And all CI succeed except Upload Coverage (pull_request) . Is it okay to begin the testing after this PR reviewed?

berg223 avatar Jun 08 '25 09:06 berg223