pulsar icon indicating copy to clipboard operation
pulsar copied to clipboard

[fix][broker][WIP] Don't check replication clusters for ownership of system topics

Open lhotari opened this issue 1 year ago • 5 comments

WIP, not ready for review. PR opened in draft mode for initial discussion and feedback.

Motivation

When investigating another issue and reproducing the problem using these instructions, https://github.com/apache/pulsar/pull/21948#issuecomment-2078264388, I noticed this WARN log message:

2024-04-26T09:34:01,535 - WARN  - [pulsar-io-35-3:ServerCnx] - Failed to get Partitioned Metadata [/127.0.0.1:56820] persistent://pulsar/global/removeClusterTest/__change_events: Namespace missing local cluster name in clusters list: local_cluster=r1 ns=pulsar/global/removeClusterTest clusters=[r2, r3]
org.apache.pulsar.broker.web.RestException: Namespace missing local cluster name in clusters list: local_cluster=r1 ns=pulsar/global/removeClusterTest clusters=[r2, r3]
        at org.apache.pulsar.broker.web.PulsarWebResource.lambda$checkLocalOrGetPeerReplicationCluster$27(PulsarWebResource.java:922) ~[classes/:?]
        at java.base/java.util.concurrent.CompletableFuture.uniAcceptNow(CompletableFuture.java:757) ~[?:?]
        at java.base/java.util.concurrent.CompletableFuture.uniAcceptStage(CompletableFuture.java:735) ~[?:?]
        at java.base/java.util.concurrent.CompletableFuture.thenAccept(CompletableFuture.java:2182) ~[?:?]
        at org.apache.pulsar.broker.web.PulsarWebResource.lambda$checkLocalOrGetPeerReplicationCluster$29(PulsarWebResource.java:912) ~[classes/:?]
        at java.base/java.util.concurrent.CompletableFuture.uniAcceptNow(CompletableFuture.java:757) ~[?:?]
        at java.base/java.util.concurrent.CompletableFuture.uniAcceptStage(CompletableFuture.java:735) ~[?:?]
        at java.base/java.util.concurrent.CompletableFuture.thenAccept(CompletableFuture.java:2182) ~[?:?]
        at org.apache.pulsar.broker.web.PulsarWebResource.checkLocalOrGetPeerReplicationCluster(PulsarWebResource.java:896) ~[classes/:?]
        at org.apache.pulsar.broker.admin.impl.PersistentTopicsBase.unsafeGetPartitionedTopicMetadataAsync(PersistentTopicsBase.java:4304) ~[classes/:?]
        at org.apache.pulsar.broker.service.ServerCnx.lambda$handlePartitionMetadataRequest$7(ServerCnx.java:610) ~[classes/:?]
        at java.base/java.util.concurrent.CompletableFuture.uniApplyNow(CompletableFuture.java:684) [?:?]
        at java.base/java.util.concurrent.CompletableFuture.uniApplyStage(CompletableFuture.java:662) [?:?]
        at java.base/java.util.concurrent.CompletableFuture.thenApply(CompletableFuture.java:2168) [?:?]
        at org.apache.pulsar.broker.service.ServerCnx.handlePartitionMetadataRequest(ServerCnx.java:607) [classes/:?]
        at org.apache.pulsar.common.protocol.PulsarDecoder.channelRead(PulsarDecoder.java:134) [pulsar-common-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]

Modifications

  • Don't limit system topics with the checkLocalOrGetPeerReplicationCluster checks.

Additional Context

This is related to #20304 changes since that made some relaxation on the checks.

Documentation

  • [ ] doc
  • [ ] doc-required
  • [x] doc-not-needed
  • [ ] doc-complete

lhotari avatar Apr 26 '24 06:04 lhotari

@poorbarcode @heesung-sn Do you think that relaxing the checkLocalOrGetPeerReplicationCluster checks for system topics would be useful?

lhotari avatar Apr 26 '24 06:04 lhotari

I realized that Pulsar has some loopholes that unnecessarily try to replicate, migrate or do other operations on system topics.

We need to check all system topics and define their behaviors for each feature.

System namespace and system topics

  • load report
  • blue-green migration
  • geo-replication
  • cross-cluster ownership check
  • and others

I am checking the system topics from the new load balancer.

heesung-sohn avatar Apr 26 '24 07:04 heesung-sohn

To answer your questions, Im not sure. I assume all system topics should live in local cluster only,but I could be wrong.

heesung-sohn avatar Apr 26 '24 07:04 heesung-sohn

Same exception with current master branch version 93afd89b047

2024-04-29T08:47:36,525 - WARN  - [pulsar-io-35-4:ServerCnx] - Failed to get Partitioned Metadata [/127.0.0.1:61388] persistent://pulsar/global/removeClusterTest/__change_events: Namespace missing local cluster name in clusters list: local_cluster=r1 ns=pulsar/global/removeClusterTest clusters=[r2, r3]
org.apache.pulsar.broker.web.RestException: Namespace missing local cluster name in clusters list: local_cluster=r1 ns=pulsar/global/removeClusterTest clusters=[r2, r3]
        at org.apache.pulsar.broker.web.PulsarWebResource.lambda$checkLocalOrGetPeerReplicationCluster$27(PulsarWebResource.java:922) ~[classes/:?]
        at java.base/java.util.concurrent.CompletableFuture.uniAcceptNow(CompletableFuture.java:757) ~[?:?]
        at java.base/java.util.concurrent.CompletableFuture.uniAcceptStage(CompletableFuture.java:735) ~[?:?]
        at java.base/java.util.concurrent.CompletableFuture.thenAccept(CompletableFuture.java:2182) ~[?:?]
        at org.apache.pulsar.broker.web.PulsarWebResource.lambda$checkLocalOrGetPeerReplicationCluster$29(PulsarWebResource.java:912) ~[classes/:?]
        at java.base/java.util.concurrent.CompletableFuture.uniAcceptNow(CompletableFuture.java:757) ~[?:?]
        at java.base/java.util.concurrent.CompletableFuture.uniAcceptStage(CompletableFuture.java:735) ~[?:?]
        at java.base/java.util.concurrent.CompletableFuture.thenAccept(CompletableFuture.java:2182) ~[?:?]
        at org.apache.pulsar.broker.web.PulsarWebResource.checkLocalOrGetPeerReplicationCluster(PulsarWebResource.java:896) ~[classes/:?]
        at org.apache.pulsar.broker.admin.impl.PersistentTopicsBase.unsafeGetPartitionedTopicMetadataAsync(PersistentTopicsBase.java:4307) ~[classes/:?]
        at org.apache.pulsar.broker.service.ServerCnx.lambda$handlePartitionMetadataRequest$7(ServerCnx.java:610) ~[classes/:?]
        at java.base/java.util.concurrent.CompletableFuture.uniApplyNow(CompletableFuture.java:684) [?:?]
        at java.base/java.util.concurrent.CompletableFuture.uniApplyStage(CompletableFuture.java:662) [?:?]
        at java.base/java.util.concurrent.CompletableFuture.thenApply(CompletableFuture.java:2168) [?:?]
        at org.apache.pulsar.broker.service.ServerCnx.handlePartitionMetadataRequest(ServerCnx.java:607) [classes/:?]
        at org.apache.pulsar.common.protocol.PulsarDecoder.channelRead(PulsarDecoder.java:134) [pulsar-common-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) [netty-transport-4.1.108.Final.jar:4.1.108.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) [netty-transport-4.1.108.Final.jar:4.1.108.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) [netty-transport-4
.1.108.Final.jar:4.1.108.Final]

lhotari avatar Apr 29 '24 05:04 lhotari

It seems that __change_events has special handling in replication:

https://github.com/apache/pulsar/blob/93afd89b047ac56d3b7e476f578993197cf41935/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/persistent/SystemTopic.java#L69-L74

https://github.com/apache/pulsar/blob/93afd89b047ac56d3b7e476f578993197cf41935/pulsar-common/src/main/java/org/apache/pulsar/common/naming/SystemTopicNames.java#L80-L85

Therefore I'm not sure if the solution in this PR is correct or not.

lhotari avatar Apr 29 '24 06:04 lhotari