pulsar icon indicating copy to clipboard operation
pulsar copied to clipboard

[fix] [broker] Part-2: Replicator can not created successfully due to an orphan replicator in the previous topic owner

Open poorbarcode opened this issue 1 year ago • 0 comments

Motivation

There is a race condition that makes an orphan replicator in the original owner of a topic, and causes the new owner of the topic can not start a replicator due to org.apache.pulsar.broker.service.BrokerServiceException$NamingException Producer with name 'pulsar.repl.{local_cluster}-->{remote_cluster}' is already connected to topic.

Scenario 1

  • Thread-1: start/restart the producer of the replicator.
  • Thread-2: unloading bundles.

Scenario 2

  • Thread-1: start a new replicator after updated replication_clusters.
  • Thread-2: unloading bundles.

After we solved the scenario 1 by https://github.com/apache/pulsar/pull/21946, the current PR is focusing on the scenario 2:

Current PR is focusing on Scenario 2.

Steps of Scenario 2

time thread enable replication thread unload bundle
1 Enabled replication
2 Mark topic as closing
3 Skip replicator.disconnect() because topic.replicators is empty
4 Initialize cursor pulsar.repl
5 Start producer
6 Set replicator.stat --> Starting
7 Create producer success and set replicator.stat --> Started
8 Trigger a readMoreEntries, since there is no entries to read, just pending this request
9 Close cursor pulsar.repl
10 Close managed ledger
11 An orphan replicator is there, and the next topic owner could not start a replicator due to Producer with name 'pulsar.repl.{local_cluster}-->{remote_cluster}' is already connected to topic

Since the scenario is too complex, I can not add a test.

TODO: reproduce the Scenario 2 locally.

Modifications

  • call replicators.disconnect after the managed ledger is closed. It would prevent the new cursor(pulsar.dedup) from being created.
  • topic.close will be done after replicators.disconnect, it can avoid the new replicator on the next owner broker of the topic failing due to creating an internal producer failed org.apache.pulsar.broker.service.BrokerServiceException$NamingException Producer with name 'pulsar.repl.{local_cluster}-->{remote_cluster}' is already connected to topic.
  • After https://github.com/apache/pulsar/pull/21947 the operation replicator.producer.close will no longer fail.

Documentation

  • [ ] doc
  • [ ] doc-required
  • [x] doc-not-needed
  • [ ] doc-complete

Matching PR in forked repository

PR in forked repository: x

poorbarcode avatar Jan 22 '24 21:01 poorbarcode