[fix][broker][WIP] Don't check replication clusters for ownership of system topics
WIP, not ready for review. PR opened in draft mode for initial discussion and feedback.
Motivation
When investigating another issue and reproducing the problem using these instructions, https://github.com/apache/pulsar/pull/21948#issuecomment-2078264388, I noticed this WARN log message:
2024-04-26T09:34:01,535 - WARN - [pulsar-io-35-3:ServerCnx] - Failed to get Partitioned Metadata [/127.0.0.1:56820] persistent://pulsar/global/removeClusterTest/__change_events: Namespace missing local cluster name in clusters list: local_cluster=r1 ns=pulsar/global/removeClusterTest clusters=[r2, r3]
org.apache.pulsar.broker.web.RestException: Namespace missing local cluster name in clusters list: local_cluster=r1 ns=pulsar/global/removeClusterTest clusters=[r2, r3]
at org.apache.pulsar.broker.web.PulsarWebResource.lambda$checkLocalOrGetPeerReplicationCluster$27(PulsarWebResource.java:922) ~[classes/:?]
at java.base/java.util.concurrent.CompletableFuture.uniAcceptNow(CompletableFuture.java:757) ~[?:?]
at java.base/java.util.concurrent.CompletableFuture.uniAcceptStage(CompletableFuture.java:735) ~[?:?]
at java.base/java.util.concurrent.CompletableFuture.thenAccept(CompletableFuture.java:2182) ~[?:?]
at org.apache.pulsar.broker.web.PulsarWebResource.lambda$checkLocalOrGetPeerReplicationCluster$29(PulsarWebResource.java:912) ~[classes/:?]
at java.base/java.util.concurrent.CompletableFuture.uniAcceptNow(CompletableFuture.java:757) ~[?:?]
at java.base/java.util.concurrent.CompletableFuture.uniAcceptStage(CompletableFuture.java:735) ~[?:?]
at java.base/java.util.concurrent.CompletableFuture.thenAccept(CompletableFuture.java:2182) ~[?:?]
at org.apache.pulsar.broker.web.PulsarWebResource.checkLocalOrGetPeerReplicationCluster(PulsarWebResource.java:896) ~[classes/:?]
at org.apache.pulsar.broker.admin.impl.PersistentTopicsBase.unsafeGetPartitionedTopicMetadataAsync(PersistentTopicsBase.java:4304) ~[classes/:?]
at org.apache.pulsar.broker.service.ServerCnx.lambda$handlePartitionMetadataRequest$7(ServerCnx.java:610) ~[classes/:?]
at java.base/java.util.concurrent.CompletableFuture.uniApplyNow(CompletableFuture.java:684) [?:?]
at java.base/java.util.concurrent.CompletableFuture.uniApplyStage(CompletableFuture.java:662) [?:?]
at java.base/java.util.concurrent.CompletableFuture.thenApply(CompletableFuture.java:2168) [?:?]
at org.apache.pulsar.broker.service.ServerCnx.handlePartitionMetadataRequest(ServerCnx.java:607) [classes/:?]
at org.apache.pulsar.common.protocol.PulsarDecoder.channelRead(PulsarDecoder.java:134) [pulsar-common-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
Modifications
- Don't limit system topics with the
checkLocalOrGetPeerReplicationClusterchecks.
Additional Context
This is related to #20304 changes since that made some relaxation on the checks.
Documentation
- [ ]
doc - [ ]
doc-required - [x]
doc-not-needed - [ ]
doc-complete
@poorbarcode @heesung-sn Do you think that relaxing the checkLocalOrGetPeerReplicationCluster checks for system topics would be useful?
I realized that Pulsar has some loopholes that unnecessarily try to replicate, migrate or do other operations on system topics.
We need to check all system topics and define their behaviors for each feature.
System namespace and system topics
- load report
- blue-green migration
- geo-replication
- cross-cluster ownership check
- and others
I am checking the system topics from the new load balancer.
To answer your questions, Im not sure. I assume all system topics should live in local cluster only,but I could be wrong.
Same exception with current master branch version 93afd89b047
2024-04-29T08:47:36,525 - WARN - [pulsar-io-35-4:ServerCnx] - Failed to get Partitioned Metadata [/127.0.0.1:61388] persistent://pulsar/global/removeClusterTest/__change_events: Namespace missing local cluster name in clusters list: local_cluster=r1 ns=pulsar/global/removeClusterTest clusters=[r2, r3]
org.apache.pulsar.broker.web.RestException: Namespace missing local cluster name in clusters list: local_cluster=r1 ns=pulsar/global/removeClusterTest clusters=[r2, r3]
at org.apache.pulsar.broker.web.PulsarWebResource.lambda$checkLocalOrGetPeerReplicationCluster$27(PulsarWebResource.java:922) ~[classes/:?]
at java.base/java.util.concurrent.CompletableFuture.uniAcceptNow(CompletableFuture.java:757) ~[?:?]
at java.base/java.util.concurrent.CompletableFuture.uniAcceptStage(CompletableFuture.java:735) ~[?:?]
at java.base/java.util.concurrent.CompletableFuture.thenAccept(CompletableFuture.java:2182) ~[?:?]
at org.apache.pulsar.broker.web.PulsarWebResource.lambda$checkLocalOrGetPeerReplicationCluster$29(PulsarWebResource.java:912) ~[classes/:?]
at java.base/java.util.concurrent.CompletableFuture.uniAcceptNow(CompletableFuture.java:757) ~[?:?]
at java.base/java.util.concurrent.CompletableFuture.uniAcceptStage(CompletableFuture.java:735) ~[?:?]
at java.base/java.util.concurrent.CompletableFuture.thenAccept(CompletableFuture.java:2182) ~[?:?]
at org.apache.pulsar.broker.web.PulsarWebResource.checkLocalOrGetPeerReplicationCluster(PulsarWebResource.java:896) ~[classes/:?]
at org.apache.pulsar.broker.admin.impl.PersistentTopicsBase.unsafeGetPartitionedTopicMetadataAsync(PersistentTopicsBase.java:4307) ~[classes/:?]
at org.apache.pulsar.broker.service.ServerCnx.lambda$handlePartitionMetadataRequest$7(ServerCnx.java:610) ~[classes/:?]
at java.base/java.util.concurrent.CompletableFuture.uniApplyNow(CompletableFuture.java:684) [?:?]
at java.base/java.util.concurrent.CompletableFuture.uniApplyStage(CompletableFuture.java:662) [?:?]
at java.base/java.util.concurrent.CompletableFuture.thenApply(CompletableFuture.java:2168) [?:?]
at org.apache.pulsar.broker.service.ServerCnx.handlePartitionMetadataRequest(ServerCnx.java:607) [classes/:?]
at org.apache.pulsar.common.protocol.PulsarDecoder.channelRead(PulsarDecoder.java:134) [pulsar-common-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) [netty-transport-4.1.108.Final.jar:4.1.108.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) [netty-transport-4.1.108.Final.jar:4.1.108.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) [netty-transport-4
.1.108.Final.jar:4.1.108.Final]
It seems that __change_events has special handling in replication:
https://github.com/apache/pulsar/blob/93afd89b047ac56d3b7e476f578993197cf41935/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/persistent/SystemTopic.java#L69-L74
https://github.com/apache/pulsar/blob/93afd89b047ac56d3b7e476f578993197cf41935/pulsar-common/src/main/java/org/apache/pulsar/common/naming/SystemTopicNames.java#L80-L85
Therefore I'm not sure if the solution in this PR is correct or not.