pulsar
pulsar copied to clipboard
[fix][broker] Replication stuck when partitions count between two clusters is not the same
Motivation
Auto-creation rule of both clusters
cluster-c1:non-partitionedcluster-c2:partitionedtopic with2partitions
Issue 1 (Issue)
- Enable double-way replication
c1 -> c2&c2 -> c1. - Create a non-partitioned topic
tp1onc1.- The messages will be attempted to replicate into both partitions of
c2, but it will fail due to aClassCaseExceptionpic-1
- The messages will be attempted to replicate into both partitions of
- At the moment, the topics on both clusters are below:
non-partitionedtopictp-1partitionedtopictp-1with2partitions.
- The new partitions
tp-1-partition-0andtp-1-partition-1on the clusterc2will also start replication.- Replicator will trigger topic auto-creation for
c1
- Replicator will trigger topic auto-creation for
- Finally, the topics on both clusters are below:
c1:non-partitionedtopictp-1non-partitionedtopictp-1-partition-0non-partitionedtopictp-1-partition-1
c2:partitionedtopictp-1with2partitions.
- The new partitions
tp-1-partition-0andtp-1-partition-1on the clusterc1will also start replication.- Since the replicator of
non-partitionedtopictp-1is already connected to the remote cluster, the new topics' replicator can not be created successfully and it will get an errorProducer with name 'pulsar.repl.c1--> c2' is already connected to topic tp-1-partition-0/tp-1-partition-1
- Since the replicator of
Issue 2
- Enable one-way replication
c1 -> c2. - Create a non-partitioned topic
tp1onc1.- The messages will be attempted to replicate into both partitions of
c2, but it will fail due to aClassCaseExceptionpic-1
- The messages will be attempted to replicate into both partitions of
- Finally, the topics on both clusters are below:
c1:non-partitionedtopictp-1c2:partitionedtopictp-1with2partitions.
pic-1:
Modifications
Replicator will only trigger a non-partitioned topic creation.
Documentation
- [ ]
doc - [ ]
doc-required - [x]
doc-not-needed - [ ]
doc-complete
Matching PR in forked repository
PR in forked repository: x
When a topic type is different(non-partitioned/partitioned) on the local and remote clusters, the pulsar-client has been thrown Producer with name 'pulsar.repl.c1--> c2' is already connected to topic tp-1-partition-0/tp-1-partition-1, I think the user has to handle this error manually, not depend on the pulsar to fix that.
@poorbarcode You mentioned "break-change" in the PR's title. Do you mean this PR will introduce break change or it fixed a break change? If it will fix a break change, could you please link the change which introduced the break change?
@codelipenghui
@poorbarcode You mentioned "break-change" in the PR's title. Do you mean this PR will introduce break change or it fixed a break change? If it will fix a break change, could you please link the change which introduced the break change?
Sorry, I assumed that the code class cast was added by https://github.com/apache/pulsar/pull/21946, if so, issue-2 described in Motivation can successfully copy data to c2 even though there are two partitions on the remote cluster.
I checked the history commits, and the class cast code was added when creating the class AbstractReplicator. I ran a test with the tag before https://github.com/apache/pulsar/pull/21946, and it does not work also.
So there is no break change( all the two scenarios can not work ), I removed the label break-change.
@nodece
When a topic type is different(non-partitioned/partitioned) on the local and remote clusters, the pulsar-client has been thrown Producer with name 'pulsar.repl.c1--> c2' is already connected to topic tp-1-partition-0/tp-1-partition-1, I think the user has to handle this error manually, not depend on the pulsar to fix that.
Yes, it does. see here
if (metadata.partitions != 0) {
log.error("[{}] The partitions count between two clusters is not the same(remote partitions: {}),"
+ " the replicator can not be created successfully: {}", replicatorId, metadata.partitions,
state);
// This exception will be caught below, so it can be any typed.
checkPartitionsSameFuture.completeExceptionally(new RuntimeException(replicatorId
+ "Can not replicate data to a partitioned topic."));
}
This PR only changes the scenario that there is no topic on the remote cluster. And makes the error logs clearer
/pulsarbot rerun-failure-checks
Codecov Report
Attention: Patch coverage is 85.71429% with 4 lines in your changes missing coverage. Please review.
Project coverage is 73.45%. Comparing base (
bbc6224) to head (ad95488). Report is 449 commits behind head on master.
Additional details and impacted files
@@ Coverage Diff @@
## master #22983 +/- ##
============================================
- Coverage 73.57% 73.45% -0.13%
+ Complexity 32624 2412 -30212
============================================
Files 1877 1914 +37
Lines 139502 143616 +4114
Branches 15299 15668 +369
============================================
+ Hits 102638 105489 +2851
- Misses 28908 30049 +1141
- Partials 7956 8078 +122
| Flag | Coverage Δ | |
|---|---|---|
| inttests | 27.52% <42.85%> (+2.93%) |
:arrow_up: |
| systests | 24.69% <21.42%> (+0.36%) |
:arrow_up: |
| unittests | 72.51% <85.71%> (-0.33%) |
:arrow_down: |
Flags with carried forward coverage won't be shown. Click here to find out more.
| Files | Coverage Δ | |
|---|---|---|
| ...ache/pulsar/broker/service/AbstractReplicator.java | 65.35% <100.00%> (-19.65%) |
:arrow_down: |
| ...ar/client/impl/conf/ProducerConfigurationData.java | 87.20% <100.00%> (-4.56%) |
:arrow_down: |
| ...roker/service/persistent/PersistentReplicator.java | 67.98% <50.00%> (-0.89%) |
:arrow_down: |
| ...rg/apache/pulsar/client/impl/PulsarClientImpl.java | 74.52% <90.47%> (+0.22%) |
:arrow_up: |
Since branch-3.2 does not contain PIP-344, so this PR can not cherry-pick into branch-3.2, removed the label release:3.2