[Bug] Compaction of __change_events topic is blocking forceful namespace/topic deletion
Search before asking
- [x] I searched in the issues and found nothing similar.
Read release policy
- [x] I understand that unsupported versions don't get bug fixes. I will attempt to reproduce the issue on a supported version of Pulsar client and Pulsar broker.
Version
Latest master
Minimal reproduce step
I was investigating #22736. Sometimes, tests fail because MockedPulsarServiceBaseTest.deleteNamespaceWithRetry times out. This seems related to a compaction of the persistent://public/default/__change_events topic.
To reproduce the timeout of deleteNamespaceWithRetry, add the following test to NamespaceAuthZTest:
@Test
public void reproduce() throws Exception {
final String namespace = "public/default";
final String topic = "persistent://" + namespace + "/" + UUID.randomUUID();
superUserAdmin.topics().createNonPartitionedTopic(topic);
superUserAdmin.topics().delete(topic, true);
final AtomicInteger succeeded = new AtomicInteger(0);
new Thread(() -> {
try {
deleteNamespaceWithRetry(namespace, true, superUserAdmin);
succeeded.incrementAndGet();
} catch (Exception e) {
throw new RuntimeException(e);
}
}).start();
new Thread(() -> {
try {
Thread.sleep(400);
superUserAdmin.topics().triggerCompaction("persistent://public/default/__change_events");
succeeded.incrementAndGet();
} catch (PulsarAdminException | InterruptedException e) {
throw new RuntimeException(e);
}
}).start();
await().atMost(60, TimeUnit.SECONDS).untilAsserted(() -> assertEquals(succeeded.get(), 2));
}
The test calls deleteNamespaceWithRetry in one thread and triggers a compaction of the persistent://public/default/__change_events topic shortly after the namespace deletion was requested. For me, a delay of 400ms works well and every test run reproduces the issue.
To avoid confusion, remove the call of deleteNamespaceWithRetry from after() method in the test class.
What did you expect to see?
I would expect the above test to pass.
What did you see instead?
Test fails.
Anything else?
From my understanding, the problem is the following:
PersistentTopic.asyncDeleteCursorWithCleanCompactionLedger()is blocked atcurrentCompaction.handle(). The compaction is already running, so it is waiting until the compaction finishes.AbstractTwoPhaseCompactor.phaseTwoLoop()is blocked atreader.readNextAsync()because it cannot read from the topic anymore.
Are you willing to submit a PR?
- [ ] I'm willing to submit a PR!
There's #23980 in this area, however I don't know if it will prevent the issue.
I think #23980 will not prevent the issue because it neither prevents a compaction and topic deletion from running simultaneously nor interrupts a compaction when the topic is deleted forcefully.
I think #23980 will not prevent the issue because it neither prevents a compaction and topic deletion from running simultaneously nor interrupts a compaction when the topic is deleted forcefully.
@pdolif That's correct, the issue continues to reproduce as it can be seen with your reproducer.
@poorbarcode Since you worked on #23980 would you be interested in addressing this issue? It seems that there are remaining issues in deleting a namespace. You can add the test case method in the description to NamespaceAuthZTest to reproduce the issue.
potentially related PR #24366