pulsar icon indicating copy to clipboard operation
pulsar copied to clipboard

[fix][broker] Replication stuck when partitions count between two clusters is not the same

Open poorbarcode opened this issue 1 year ago • 4 comments

Motivation

Auto-creation rule of both clusters

  • cluster-c1: non-partitioned
  • cluster-c2: partitioned topic with 2 partitions

Issue 1 (Issue)

  • Enable double-way replication c1 -> c2 & c2 -> c1.
  • Create a non-partitioned topic tp1 on c1.
    • The messages will be attempted to replicate into both partitions of c2, but it will fail due to a ClassCaseExceptionpic-1
  • At the moment, the topics on both clusters are below:
    • non-partitioned topic tp-1
    • partitioned topic tp-1 with 2 partitions.
  • The new partitions tp-1-partition-0 and tp-1-partition-1 on the cluster c2 will also start replication.
    • Replicator will trigger topic auto-creation for c1
  • Finally, the topics on both clusters are below:
    • c1:
      • non-partitioned topic tp-1
      • non-partitioned topic tp-1-partition-0
      • non-partitioned topic tp-1-partition-1
    • c2: partitioned topic tp-1 with 2 partitions.
  • The new partitions tp-1-partition-0 and tp-1-partition-1 on the cluster c1 will also start replication.
    • Since the replicator of non-partitioned topic tp-1 is already connected to the remote cluster, the new topics' replicator can not be created successfully and it will get an error Producer with name 'pulsar.repl.c1--> c2' is already connected to topic tp-1-partition-0/tp-1-partition-1

Issue 2

  • Enable one-way replication c1 -> c2.
  • Create a non-partitioned topic tp1 on c1.
    • The messages will be attempted to replicate into both partitions of c2, but it will fail due to a ClassCaseExceptionpic-1
  • Finally, the topics on both clusters are below:
    • c1: non-partitioned topic tp-1
    • c2: partitioned topic tp-1 with 2 partitions.

pic-1: Screenshot 2024-06-27 at 00 20 45

Modifications

Replicator will only trigger a non-partitioned topic creation.

Documentation

  • [ ] doc
  • [ ] doc-required
  • [x] doc-not-needed
  • [ ] doc-complete

Matching PR in forked repository

PR in forked repository: x

poorbarcode avatar Jun 26 '24 16:06 poorbarcode

When a topic type is different(non-partitioned/partitioned) on the local and remote clusters, the pulsar-client has been thrown Producer with name 'pulsar.repl.c1--> c2' is already connected to topic tp-1-partition-0/tp-1-partition-1, I think the user has to handle this error manually, not depend on the pulsar to fix that.

nodece avatar Jun 28 '24 06:06 nodece

@poorbarcode You mentioned "break-change" in the PR's title. Do you mean this PR will introduce break change or it fixed a break change? If it will fix a break change, could you please link the change which introduced the break change?

codelipenghui avatar Jun 28 '24 06:06 codelipenghui

@codelipenghui

@poorbarcode You mentioned "break-change" in the PR's title. Do you mean this PR will introduce break change or it fixed a break change? If it will fix a break change, could you please link the change which introduced the break change?

Sorry, I assumed that the code class cast was added by https://github.com/apache/pulsar/pull/21946, if so, issue-2 described in Motivation can successfully copy data to c2 even though there are two partitions on the remote cluster.

I checked the history commits, and the class cast code was added when creating the class AbstractReplicator. I ran a test with the tag before https://github.com/apache/pulsar/pull/21946, and it does not work also.

So there is no break change( all the two scenarios can not work ), I removed the label break-change.

poorbarcode avatar Jun 28 '24 07:06 poorbarcode

@nodece

When a topic type is different(non-partitioned/partitioned) on the local and remote clusters, the pulsar-client has been thrown Producer with name 'pulsar.repl.c1--> c2' is already connected to topic tp-1-partition-0/tp-1-partition-1, I think the user has to handle this error manually, not depend on the pulsar to fix that.

Yes, it does. see here

if (metadata.partitions != 0) {
    log.error("[{}] The partitions count between two clusters is not the same(remote partitions: {}),"
                    + " the replicator can not be created successfully: {}", replicatorId, metadata.partitions,
            state);
    // This exception will be caught below, so it can be any typed.
    checkPartitionsSameFuture.completeExceptionally(new RuntimeException(replicatorId
            + "Can not replicate data to a partitioned topic."));
}

This PR only changes the scenario that there is no topic on the remote cluster. And makes the error logs clearer

poorbarcode avatar Jun 28 '24 07:06 poorbarcode

/pulsarbot rerun-failure-checks

poorbarcode avatar Jul 05 '24 07:07 poorbarcode

Codecov Report

Attention: Patch coverage is 85.71429% with 4 lines in your changes missing coverage. Please review.

Project coverage is 73.45%. Comparing base (bbc6224) to head (ad95488). Report is 449 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #22983      +/-   ##
============================================
- Coverage     73.57%   73.45%   -0.13%     
+ Complexity    32624     2412   -30212     
============================================
  Files          1877     1914      +37     
  Lines        139502   143616    +4114     
  Branches      15299    15668     +369     
============================================
+ Hits         102638   105489    +2851     
- Misses        28908    30049    +1141     
- Partials       7956     8078     +122     
Flag Coverage Δ
inttests 27.52% <42.85%> (+2.93%) :arrow_up:
systests 24.69% <21.42%> (+0.36%) :arrow_up:
unittests 72.51% <85.71%> (-0.33%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...ache/pulsar/broker/service/AbstractReplicator.java 65.35% <100.00%> (-19.65%) :arrow_down:
...ar/client/impl/conf/ProducerConfigurationData.java 87.20% <100.00%> (-4.56%) :arrow_down:
...roker/service/persistent/PersistentReplicator.java 67.98% <50.00%> (-0.89%) :arrow_down:
...rg/apache/pulsar/client/impl/PulsarClientImpl.java 74.52% <90.47%> (+0.22%) :arrow_up:

... and 493 files with indirect coverage changes

codecov-commenter avatar Jul 15 '24 15:07 codecov-commenter

Since branch-3.2 does not contain PIP-344, so this PR can not cherry-pick into branch-3.2, removed the label release:3.2

poorbarcode avatar Jul 15 '24 15:07 poorbarcode