[fix] [broker] Part-1: Replicator can not created successfully due to an orphan replicator in the previous topic owner
Motivation
There is a race condition that makes an orphan replicator in the original owner of a topic, and causes the new owner of the topic can not start a replicator due to org.apache.pulsar.broker.service.BrokerServiceException$NamingException Producer with name 'pulsar.repl.{local_cluster}-->{remote_cluster}' is already connected to topic.
Scenario 1
- Thread-1: start/restart the producer of the replicator.
- Thread-2: unloading bundles.
Scenario 2
- Thread-1: start a new replicator after updated
replication_clusters. - Thread-2: unloading bundles.
Current PR is focusing on Scenario 1.
Steps of Scenario 1
| time | thread start replicator |
thread unload bundle |
|---|---|---|
| 1 | Initialize cursor pulsar.repl |
|
| 2 | Start producer | |
| 3 | Start producer failed, add a scheduled task to retry | |
| 4 | Mark topic as closing |
|
| 4 | Close clients: replicator.disconnect |
|
| 5 | Skip to close the producer because the producer is null, and set replicator.stat --> Stopped |
|
| 6 | Retry to start the producer | |
| 7 | Set replicator.stat --> Starting |
|
| 8 | Create producer success and set replicator.stat --> Started |
|
| 9 | Trigger a readMoreEntries, since there is no entries to read, just pending this request |
|
| 10 | Close cursor pulsar.repl |
|
| 11 | Close managed ledger | |
| 12 | An orphan replicator is there, and the next topic owner could not start a replicator due to Producer with name 'pulsar.repl.{local_cluster}-->{remote_cluster}' is already connected to topic |
Modifications
- Split the state of
Replicator.State.StoppedintoProducer_StoppedandClosed.- The producer can be restart again after the producer closed due to read entries error or ack messages error.
- The Replicator can not be started again after it was closed due to the topic being closed or having disabled replication.
- Add a new method
terminateto close the Replicator.- The old method
disconnectonly used to close the internal producer.
- The old method
A case that hit this issue
2024-01-19T22:18:24,498+0000 [pulsar-io-4-23] ERROR org.apache.pulsar.client.impl.ProducerImpl - [persistent://xxx_stress.Default-226] [pulsar.repl.c1-->c2] Failed to create producer: {"errorMsg":"org.apache.pulsar.broker.service.BrokerServiceException$NamingException: Producer with name 'pulsar.repl.c1-->c2' is already connected to topic","reqId":885581189144276683, "remote":"xxx/xxx:6651", "local":"/xxx:59416"}
...
Picture-1: An orphan producer was left in old broker, it is not associated with any topic/replicator
Picture-2: After the topic is transferred to new broker, it can not start a new Replicator successfully
Since the scenario is too complex, I can not add a test. But I reproduced the Scenario 1 locally.
https://github.com/apache/pulsar/pull/21948 fixes the following issues:
- How to perfectly start a new Replicator after the replication has been enabled again.
- How to perfectly start a new Replicator when calling
topic.unfenceTopicToResumeaftertopic.closefailed.
Documentation
- [ ]
doc - [ ]
doc-required - [x]
doc-not-needed - [ ]
doc-complete
Matching PR in forked repository
PR in forked repository: x
@poorbarcode Does this PR fix the issue mentioned in https://github.com/apache/pulsar/pull/21203 ?
@Jason918
@poorbarcode Does this PR fix the issue mentioned in https://github.com/apache/pulsar/pull/21203 ?
Yes, the current PR also fixed the issue that https://github.com/apache/pulsar/pull/21203 tries to fix.
@codelipenghui
And it looks like we can simplify the fix by adding a new method terminate() to the replicator so that we don't need to mix the closeProducer and closeReplicator logic.
Sure, renamed disconnect to terminate.
Is it possible to add a test to cover this case?
I will try to write a test laster
@codelipenghui
Is it possible to add a test to cover this case?
The test testConcurrencyOfUnloadBundleAndRecreateProducer has been added, please take a look ❤️
/pulsarbot run-failure-checks
Rebase master
Because there are too many conflicts and there are no new releases for 2.11, do not cherry-pick this PR into 2.11, note: the following PRs need to be cherry-picked together when performing cherry-picking.
- https://github.com/apache/pulsar/pull/21946
- https://github.com/apache/pulsar/pull/22537
- https://github.com/apache/pulsar/pull/21948
- https://github.com/apache/pulsar/pull/17524
- https://github.com/apache/pulsar/pull/22594