pulsar
pulsar copied to clipboard
[fix] [broker] Part-2: Replicator can not created successfully due to an orphan replicator in the previous topic owner
Motivation
There is a race condition that makes an orphan replicator in the original owner of a topic, and causes the new owner of the topic can not start a replicator due to org.apache.pulsar.broker.service.BrokerServiceException$NamingException Producer with name 'pulsar.repl.{local_cluster}-->{remote_cluster}' is already connected to topic.
Scenario 1
- Thread-1: start/restart the producer of the replicator.
- Thread-2: unloading bundles.
Scenario 2
- Thread-1: start a new replicator after updated
replication_clusters. - Thread-2: unloading bundles.
After we solved the scenario 1 by https://github.com/apache/pulsar/pull/21946, the current PR is focusing on the scenario 2:
Current PR is focusing on Scenario 2.
Steps of Scenario 2
| time | thread enable replication |
thread unload bundle |
|---|---|---|
| 1 | Enabled replication | |
| 2 | Mark topic as closing |
|
| 3 | Skip replicator.disconnect() because topic.replicators is empty |
|
| 4 | Initialize cursor pulsar.repl |
|
| 5 | Start producer | |
| 6 | Set replicator.stat --> Starting |
|
| 7 | Create producer success and set replicator.stat --> Started |
|
| 8 | Trigger a readMoreEntries, since there is no entries to read, just pending this request |
|
| 9 | Close cursor pulsar.repl |
|
| 10 | Close managed ledger | |
| 11 | An orphan replicator is there, and the next topic owner could not start a replicator due to Producer with name 'pulsar.repl.{local_cluster}-->{remote_cluster}' is already connected to topic |
Since the scenario is too complex, I can not add a test.
TODO: reproduce the Scenario 2 locally.
Modifications
- call
replicators.disconnectafter the managed ledger is closed. It would prevent the new cursor(pulsar.dedup) from being created. -
topic.closewill be done afterreplicators.disconnect, it can avoid the new replicator on the next owner broker of the topic failing due to creating an internal producer failedorg.apache.pulsar.broker.service.BrokerServiceException$NamingException Producer with name 'pulsar.repl.{local_cluster}-->{remote_cluster}' is already connected to topic. - After https://github.com/apache/pulsar/pull/21947 the operation
replicator.producer.closewill no longer fail.
Documentation
- [ ]
doc - [ ]
doc-required - [x]
doc-not-needed - [ ]
doc-complete
Matching PR in forked repository
PR in forked repository: x