set minPartitionsAutoDiscoveryInterval to prevent partition metadata lookup overwhelm brokers
set minPartitionsAutoDiscoveryInterval to prevent partition metadata lookup overwhelm brokers
Motivation
This issue is described by this client application's PR https://github.com/signalfx/splunk-otel-collector/pull/2185
On a 2.7 broker, a very short interval of partition auto discovery, the producer can overwhelm the broker. We have observed very high CPU usage. In an extreme case, a broker can run 100% CPU even without any topic loaded. The broker trace stack looks like
org.apache.pulsar.broker.service.PulsarCommandSenderImpl.sendPartitionMetadataResponse(PulsarCommandSenderImpl.java:65)
at org.apache.pulsar.broker.service.ServerCnx.lambda$null$7(ServerCnx.java:455)
at org.apache.pulsar.broker.service.ServerCnx$$Lambda$666/0x000000084070a440.apply(Unknown Source)
at java.util.concurrent.CompletableFuture.uniHandle([email protected]/CompletableFuture.java:930)
at java.util.concurrent.CompletableFuture.uniHandleStage([email protected]/CompletableFuture.java:946)
at java.util.concurrent.CompletableFuture.handle([email protected]/CompletableFuture.java:2266)
at org.apache.pulsar.broker.service.ServerCnx.lambda$handlePartitionMetadataRequest$8(ServerCnx.java:452)
at org.apache.pulsar.broker.service.ServerCnx$$Lambda$661/0x0000000840708040.apply(Unknown Source)
at java.util.concurrent.CompletableFuture.uniApplyNow([email protected]/CompletableFuture.java:680)
at java.util.concurrent.CompletableFuture.uniApplyStage([email protected]/CompletableFuture.java:658)
at java.util.concurrent.CompletableFuture.thenApply([email protected]/CompletableFuture.java:2094)
at org.apache.pulsar.broker.service.ServerCnx.handlePartitionMetadataRequest(ServerCnx.java:449)
at org.apache.pulsar.common.protocol.PulsarDecoder.channelRead(PulsarDecoder.java:122)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
at io.netty.handler.flow.FlowControlHandler.dequeue(FlowControlHandler.java:200)
at io.netty.handler.flow.FlowControlHandler.channelRead(FlowControlHandler.java:162)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:324)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:296)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1534)
at io.netty.handler.ssl.SslHandler.decodeNonJdkCompatible(SslHandler.java:1295)
at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1332)
Modifications
Set one second as the floor value for the PartitionsAutoDiscoveryInterval. This will prevent a high frequent look up call from the client.
Verifying this change
- [x] Make sure that the change passes the CI checks.
Does this pull request potentially affect one of the following parts:
If yes was chosen, please highlight the changes
- Dependencies (does it add or upgrade a dependency): (yes / no)
- The public API: (no)
- The schema: (no)
- The default values of configurations: (no)
- The wire protocol: (no)
Documentation
- Does this pull request introduce a new feature? ( no)
- If yes, how is the feature documented? (not applicable)
It might be nice in the future to add a field partitionsAutoDiscoveryIntervalSeconds and deprecate the current one. Then the time unit is more clear to the user.
It might be nice in the future to add a field partitionsAutoDiscoveryIntervalSeconds and deprecate the current one. Then the time unit is more clear to the user.
+1