pulsar icon indicating copy to clipboard operation
pulsar copied to clipboard

[fix] [broker] Part-1: Replicator can not created successfully due to an orphan replicator in the previous topic owner

Open poorbarcode opened this issue 1 year ago • 5 comments

Motivation

There is a race condition that makes an orphan replicator in the original owner of a topic, and causes the new owner of the topic can not start a replicator due to org.apache.pulsar.broker.service.BrokerServiceException$NamingException Producer with name 'pulsar.repl.{local_cluster}-->{remote_cluster}' is already connected to topic.

Scenario 1

  • Thread-1: start/restart the producer of the replicator.
  • Thread-2: unloading bundles.

Scenario 2

  • Thread-1: start a new replicator after updated replication_clusters.
  • Thread-2: unloading bundles.

Current PR is focusing on Scenario 1.

Steps of Scenario 1

time thread start replicator thread unload bundle
1 Initialize cursor pulsar.repl
2 Start producer
3 Start producer failed, add a scheduled task to retry
4 Mark topic as closing
4 Close clients: replicator.disconnect
5 Skip to close the producer because the producer is null, and set replicator.stat --> Stopped
6 Retry to start the producer
7 Set replicator.stat --> Starting
8 Create producer success and set replicator.stat --> Started
9 Trigger a readMoreEntries, since there is no entries to read, just pending this request
10 Close cursor pulsar.repl
11 Close managed ledger
12 An orphan replicator is there, and the next topic owner could not start a replicator due to Producer with name 'pulsar.repl.{local_cluster}-->{remote_cluster}' is already connected to topic

Modifications

  • Split the state of Replicator.State.Stopped into Producer_Stopped and Closed.
    • The producer can be restart again after the producer closed due to read entries error or ack messages error.
    • The Replicator can not be started again after it was closed due to the topic being closed or having disabled replication.
  • Add a new method terminate to close the Replicator.
    • The old method disconnect only used to close the internal producer.

A case that hit this issue

2024-01-19T22:18:24,498+0000 [pulsar-io-4-23] ERROR org.apache.pulsar.client.impl.ProducerImpl - [persistent://xxx_stress.Default-226] [pulsar.repl.c1-->c2] Failed to create producer: {"errorMsg":"org.apache.pulsar.broker.service.BrokerServiceException$NamingException: Producer with name 'pulsar.repl.c1-->c2' is already connected to topic","reqId":885581189144276683, "remote":"xxx/xxx:6651", "local":"/xxx:59416"}	
...

Picture-1: An orphan producer was left in old broker, it is not associated with any topic/replicator Screenshot 2024-01-24 at 15 16 44

Picture-2: After the topic is transferred to new broker, it can not start a new Replicator successfully Screenshot 2024-01-24 at 15 14 42

Since the scenario is too complex, I can not add a test. But I reproduced the Scenario 1 locally. Screenshot 2024-01-23 at 00 36 48 Screenshot 2024-01-23 at 00 35 33

https://github.com/apache/pulsar/pull/21948 fixes the following issues:

  • How to perfectly start a new Replicator after the replication has been enabled again.
  • How to perfectly start a new Replicator when calling topic.unfenceTopicToResume after topic.close failed.

Documentation

  • [ ] doc
  • [ ] doc-required
  • [x] doc-not-needed
  • [ ] doc-complete

Matching PR in forked repository

PR in forked repository: x

poorbarcode avatar Jan 22 '24 20:01 poorbarcode

@poorbarcode Does this PR fix the issue mentioned in https://github.com/apache/pulsar/pull/21203 ?

Jason918 avatar Jan 26 '24 08:01 Jason918

@Jason918

@poorbarcode Does this PR fix the issue mentioned in https://github.com/apache/pulsar/pull/21203 ?

Yes, the current PR also fixed the issue that https://github.com/apache/pulsar/pull/21203 tries to fix.

poorbarcode avatar Jan 28 '24 16:01 poorbarcode

@codelipenghui

And it looks like we can simplify the fix by adding a new method terminate() to the replicator so that we don't need to mix the closeProducer and closeReplicator logic.

Sure, renamed disconnect to terminate.

Is it possible to add a test to cover this case?

I will try to write a test laster

poorbarcode avatar Jan 29 '24 14:01 poorbarcode

@codelipenghui

Is it possible to add a test to cover this case?

The test testConcurrencyOfUnloadBundleAndRecreateProducer has been added, please take a look ❤️

poorbarcode avatar Jan 31 '24 17:01 poorbarcode

/pulsarbot run-failure-checks

codelipenghui avatar Feb 01 '24 12:02 codelipenghui

Rebase master

poorbarcode avatar Apr 15 '24 16:04 poorbarcode

Because there are too many conflicts and there are no new releases for 2.11, do not cherry-pick this PR into 2.11, note: the following PRs need to be cherry-picked together when performing cherry-picking.

  • https://github.com/apache/pulsar/pull/21946
  • https://github.com/apache/pulsar/pull/22537
  • https://github.com/apache/pulsar/pull/21948
  • https://github.com/apache/pulsar/pull/17524
  • https://github.com/apache/pulsar/pull/22594

poorbarcode avatar May 07 '24 12:05 poorbarcode