[Bug] Consumption unevenness arises in the consumption performance pressure test.
Search before reporting
- [x] I searched in the issues and found nothing similar.
Read release policy
- [x] I understand that unsupported versions don't get bug fixes. I will attempt to reproduce the issue on a supported version of Pulsar client and Pulsar broker.
User environment
3.0.10 jdk17 client 3.0.10
Issue Description
When conducting consumption performance pressure tests with a relatively large number of consumers per partition (over 1000), consumption imbalance will occur: the consumption speed of some partitions drops significantly, the overall throughput declines, and message backlog eventually arises in certain partitions.
Error messages
no error
Reproducing the issue
When conducting consumption performance pressure tests with a relatively large number of consumers per partition (over 1000), consumption imbalance will occur: the consumption speed of some partitions drops significantly, the overall throughput declines, and message backlog eventually arises in certain partitions.
Additional information
No response
Are you willing to submit a PR?
- [ ] I'm willing to submit a PR!
Reproducing the issue
When conducting consumption performance pressure tests with a relatively large number of consumers per partition (over 1000), consumption imbalance will occur: the consumption speed of some partitions drops significantly, the overall throughput declines, and message backlog eventually arises in certain partitions.
It would be useful to provide more practical details.
3.0.10 jdk17 client 3.0.10
@g0715158 Please retest with Pulsar 4.1.2 version, running on Java 21.0.9 (or newer). Pulsar 4.1.x comes with "PIP-430: Pulsar Broker Cache Improvements: Refactoring Eviction and Adding a New Cache Strategy Based on Expected Read Count" which improves performance significantly in the scenario you have described.
Before PIP-430, when consumers "drop off the tail", the entries are no longer retrieved from cache, but have to be retrieved all the way from BookKeeper. This adds additional load and latency to the system. In PIP-430, you would still need to ensure that the broker cache is large enough to hold the in progress entries (batch messages). For a large number of partitions, you need a large broker cache size (configured with managedLedgerCacheSizeMB). You should also set managedLedgerMaxReadsInFlightSizeInMB to a reasonable amount.
You didn't mention how many subscriptions you have. Are there a lot of consumers on a single shared subscription, so that each partition has one shared subscription?
@lhotari Currently, my goal is to identify the root cause of partition imbalance in version 3.0.10. Approximately 1000 consumers per partition experience performance degradation after consuming for about 9 minutes, and partition imbalance can be observed from the console. I would like to ask if such a situation has occurred in any existing issues?
Currently, my goal is to identify the root cause of partition imbalance in version 3.0.10.
@g0715158 In the OSS project, we don't maintain specific versions such as 3.0.10. 3.0.x continues to be maintained, but the latest released version is 3.0.15 . For this case, please attempt to reproduce with 4.1.2 as I suggested before. If you cannot reproduce, that's a lot more information for you to identify the root cause in version 3.0.10.
Approximately 1000 consumers per partition experience performance degradation after consuming for about 9 minutes, and partition imbalance can be observed from the console. I would like to ask if such a situation has occurred in any existing issues?
Yes, I've seen that happen. The imbalance is common, but the case where consuming stops completely might be a different issue such as #24926 .
However, in the stats that you shared, there are many cases where the backlog is 0 for the partitions that have out rate of 0. Is this test case of a scenario where producers are producing actively and consumers are following? Or is it a "catch-up scenario" where there's existing backlog which consumers consume.
In your test scenario, you didn't mention anything about the client side. How many separate client instances and/or client connections do you have? How well is the client side tuned? For example, https://pulsar.apache.org/docs/next/client-libraries-java-setup/#java-client-performance ?
When you are creating a large number of Java client instances in a single JVM, it's necessary to share resources. There's an example in branch-4.0 in this test: https://github.com/apache/pulsar/blob/branch-4.0/pulsar-broker/src/test/java/org/apache/pulsar/client/api/PatternConsumerBackPressureMultipleConsumersTest.java#L238-L280 .
For 4.1+, there's PIP-234 PulsarClientSharedResources: https://github.com/apache/pulsar/blob/270120ce6e33e5a084397ca31186f1bb87835e48/pulsar-broker/src/test/java/org/apache/pulsar/client/api/PatternConsumerBackPressureMultipleConsumersTest.java#L103-L123
If you are actively producing to partitions in a test case, one common issue for test scenarios is the producing side. It's also possible that the producing side doesn't produce evenly across partitions. One way to solve this is to produce individually to specific partitions (*-partition-0, *-partition-1, ...) in the load generator and have a sufficient amount of separate nodes for producing the messages so that the bottleneck isn't in producing clients.
On the producer side, using a multi-topic (partitioned) producer will also have more variance across partitions due to the default use of https://github.com/apache/pulsar/blob/master/pulsar-client/src/main/java/org/apache/pulsar/client/impl/RoundRobinPartitionMessageRouterImpl.java .
The setting can be controlled with https://github.com/apache/pulsar/blob/cc5e479d63103f81e3af833e8b06227d1a6563e1/pulsar-client-api/src/main/java/org/apache/pulsar/client/api/ProducerBuilder.java#L462-L474 .
The defaults are time based for both routing and batching. For testing purposes, it could be better to use a count based routing if partitioned producer is used and configure batching with a long batchingMaxPublishDelay and use batchingMaxMessages to achieve similar sized batches each time.
In dispatching, I have had some doubts of the source of imbalance in this logic: https://github.com/apache/pulsar/blob/ea56ada4f3985c93b93c64d1361b3111cd98a37f/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/AbstractDispatcherMultipleConsumers.java#L77-L226
@lhotari Thank you so much for your help! When I ran pressure tests with 1KB small messages, the performance was solid and stable—only partition imbalance occurred, with a TPS of around 250K. However, when consuming 500KB large messages, performance degrades after a few minutes, paired with severe partition consumption imbalance. The most critical issue is that the consumption rate metrics for partitions vanish entirely: neither pulsar-manager API queries nor the pulsar_rate_out metric returns any data (this affects all brokers). Other metrics appear normal, and no thread blockages were detected in the broker thread dumps. Additionally, no priority distinctions were configured for the consumers.
@lhotari Thank you so much for your help! When I ran pressure tests with 1KB small messages, the performance was solid and stable—only partition imbalance occurred, with a TPS of around 250K. However, when consuming 500KB large messages, performance degrades after a few minutes, paired with severe partition consumption imbalance. The most critical issue is that the consumption rate metrics for partitions vanish entirely: neither pulsar-manager API queries nor the pulsar_rate_out metric returns any data (this affects all brokers). Other metrics appear normal, and no thread blockages were detected in the broker thread dumps. Additionally, no priority distinctions were configured for the consumers.
Have you configured managedLedgerMaxReadsInFlightSizeInMB to a reasonable value? The feature isn't enabled by default. It was added in 2.11 with #18245 and improved later in #23901 . There's some details about broker memory limits in PIP-442 https://github.com/apache/pulsar/blob/master/pip/pip-442.md#existing-broker-memory-management .
If you haven't configured managedLedgerMaxReadsInFlightSizeInMB, the broker can get overloaded with a large amount of consumers. You can try setting it to a value like 500 MB when you have at least 4GB of direct memory available.
Other metrics appear normal, and no thread blockages were detected in the broker thread dumps.
What metrics are you checking? If you want to get proper performance, you should be optimizing broker cache hit rate. There are dashboards such as https://github.com/lhotari/pulsar-grafana-dashboards/blob/master/pulsar/broker-cache.json and https://github.com/lhotari/pulsar-grafana-dashboards/blob/master/pulsar/broker-cache-by-broker.json for this purpose.
Before Pulsar 4.1 and PIP-430, there are workarounds for tuning the broker cache. #23466.
Please also check #24695
@lhotari We have not enabled this parameter as managedLedgerMaxReadsInFlightSizeInMB=0. Currently, the broker has 20GB of heap memory and 40GB of off-heap memory available. Would it be appropriate for us to set this parameter to 2GB?
@lhotari We have not enabled this parameter as managedLedgerMaxReadsInFlightSizeInMB=0. Currently, the broker has 20GB of heap memory and 40GB of off-heap memory available. Would it be appropriate for us to set this parameter to 2GB?
Yes, I think around 2GB-4GB is reasonable in that case. (btw. The limit isn't an accurate limit and more direct memory will get used).