pulsar icon indicating copy to clipboard operation
pulsar copied to clipboard

[Bug] Auto topic delete of partitioned topics has a MetadataCacheImpl race condition

Open darinspivey opened this issue 2 months ago • 19 comments

Search before reporting

  • [x] I searched in the issues and found nothing similar.

Read release policy

  • [x] I understand that unsupported versions don't get bug fixes. I will attempt to reproduce the issue on a supported version of Pulsar client and Pulsar broker.

User environment

We are using partitioned topics of 6 partitions, and we have auto-topic-deletion turned on. When the GC fires to delete the partitions, as expected, they mostly fire at the same time. This appears to sometimes cause a race condition when they all attempt to update the topic's metadata. This has a cascading effect of:

  • The metadata update never completes, even with jitter and retries
  • The topic deletion eventually fails as a whole because the race conditions causes 1 or more partitions to remain
  • The partition(s) becomes orphaned, and the parent topic never can be deleted because GC does not appear to try again after a period of time.

System info

/pulsar $ pulsar version
Current version of pulsar is: 4.0.6
Git Revision 4538ef7645c45a3c8686092128fde6c5d61c762b
Git Branch branch-4.0
Built by Lari Hotari <[email protected]> on Laris-MBP.lan at 2025-07-30T13:37:25+0300

/pulsar $ uname -a
Linux pulsar-broker-0 6.12.46-66.121.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Sep 22 16:35:59 UTC 2025 x86_64 GNU/Linux

Relevant helm values (Chart pulsar-4.2.0)

    brokerDeleteInactiveTopicsEnabled: 'true'
    brokerDeleteInactiveTopicsFrequencySeconds: '60'
    brokerDeleteInactiveTopicsMaxInactiveDurationSeconds: '60'
    brokerDeleteInactiveTopicsMode: delete_when_no_subscriptions
    brokerDeleteInactivePartitionedTopicMetadataEnabled: 'true'
    allowAutoTopicCreation: 'true'
    defaultNumPartitions: '6'
    allowAutoTopicCreationType: 'partitioned'
  • Note: other settings for auto-deleting subscriptions works fine. The problem is happening when the topic attempts auto-deletion (TLDR delete_when_no_subscriptions is working fine).

Relevant Logs

Oct 18 06:31:02 pulsar-broker-1 pulsar-broker [broker-topic-workers-OrderedExecutor-0-0] INFO  org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - ourtenant/ourapp/persistent/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75-partition-2 Moving to FencedForDeletion state
Oct 18 06:31:02 pulsar-broker-1 pulsar-broker [BookKeeperClientWorker-OrderedExecutor-0-0] INFO  org.apache.bookkeeper.mledger.impl.MetaStoreImpl - [ourtenant/ourapp/persistent/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75-partition-2] Remove ManagedLedger
Oct 18 06:31:02 pulsar-broker-1 pulsar-broker [BookKeeperClientWorker-OrderedExecutor-0-0] INFO  org.apache.pulsar.metadata.impl.AbstractMetadataStore - Deleting path: /managed-ledgers/ourtenant/ourapp/persistent/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75-partition-2 (v. Optional.empty)
Oct 18 06:31:02 pulsar-broker-1 pulsar-broker INFO [metadata-store-10-1] INFO  org.apache.pulsar.metadata.impl.AbstractMetadataStore - Deleted path: /managed-ledgers/ourtenant/ourapp/persistent/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75-partition-2 (v. Optional.empty)
Oct 18 06:31:02 pulsar-broker-1 pulsar-broker [bookkeeper-ml-scheduler-OrderedScheduler-0-0] INFO  org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [ourtenant/ourapp/persistent/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75-partition-2] Successfully deleted managed ledger
Oct 18 06:31:02 pulsar-broker-1 pulsar-broker [bookkeeper-ml-scheduler-OrderedScheduler-0-0] INFO  org.apache.pulsar.broker.service.persistent.PersistentTopic - [persistent://ourtenant/ourapp/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75-partition-2] Topic deleted
Oct 18 06:31:03 pulsar-broker-1 pulsar-broker INFO [metadata-store-10-1] INFO  org.apache.pulsar.broker.service.persistent.PersistentTopic - [persistent://ourtenant/ourapp/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75] Delete topic metadata failed because another partition exist.
Oct 18 06:31:03 pulsar-broker-1 pulsar-broker INFO [metadata-store-10-1] INFO  org.apache.pulsar.broker.service.persistent.PersistentTopic - [persistent://ourtenant/ourapp/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75-partition-2] Skip to delete partitioned topic: Another partition exists for [persistent://ourtenant/ourapp/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75].
Oct 18 06:31:05 pulsar-broker-2 pulsar-broker [pulsar-inactivity-monitor-OrderedScheduler-0-0] INFO  org.apache.pulsar.broker.service.persistent.PersistentTopic - [persistent://ourtenant/ourapp/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75-partition-0] Global topic inactive for 60 seconds, closed repl producers
Oct 18 06:31:05 pulsar-broker-2 pulsar-broker [pulsar-inactivity-monitor-OrderedScheduler-0-0] INFO  org.apache.pulsar.broker.service.persistent.PersistentTopic - [persistent://ourtenant/ourapp/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75-partition-3] Global topic inactive for 60 seconds, closed repl producers
Oct 18 06:31:05 pulsar-broker-2 pulsar-broker [pulsar-inactivity-monitor-OrderedScheduler-0-0] INFO  org.apache.pulsar.broker.service.persistent.PersistentTopic - [persistent://ourtenant/ourapp/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75-partition-4] Global topic inactive for 60 seconds, closed repl producers
Oct 18 06:31:05 pulsar-broker-2 pulsar-broker [broker-topic-workers-OrderedExecutor-0-0] INFO  org.apache.pulsar.broker.service.BrokerService - Successfully delete authentication policies for topic persistent://ourtenant/ourapp/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75-partition-0
Oct 18 06:31:05 pulsar-broker-2 pulsar-broker [broker-topic-workers-OrderedExecutor-0-0] INFO  org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - ourtenant/ourapp/persistent/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75-partition-0 Moving to FencedForDeletion state
Oct 18 06:31:05 pulsar-broker-2 pulsar-broker [broker-topic-workers-OrderedExecutor-0-0] INFO  org.apache.pulsar.broker.service.BrokerService - Successfully delete authentication policies for topic persistent://ourtenant/ourapp/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75-partition-3
Oct 18 06:31:05 pulsar-broker-2 pulsar-broker [broker-topic-workers-OrderedExecutor-0-0] INFO  org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - ourtenant/ourapp/persistent/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75-partition-3 Moving to FencedForDeletion state
Oct 18 06:31:05 pulsar-broker-2 pulsar-broker [broker-topic-workers-OrderedExecutor-0-0] INFO  org.apache.pulsar.broker.service.BrokerService - Successfully delete authentication policies for topic persistent://ourtenant/ourapp/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75-partition-4
Oct 18 06:31:05 pulsar-broker-2 pulsar-broker [broker-topic-workers-OrderedExecutor-0-0] INFO  org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - ourtenant/ourapp/persistent/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75-partition-4 Moving to FencedForDeletion state
Oct 18 06:31:05 pulsar-broker-2 pulsar-broker [BookKeeperClientWorker-OrderedExecutor-0-0] INFO  org.apache.bookkeeper.mledger.impl.MetaStoreImpl - [ourtenant/ourapp/persistent/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75-partition-0] Remove ManagedLedger
Oct 18 06:31:05 pulsar-broker-2 pulsar-broker [BookKeeperClientWorker-OrderedExecutor-0-0] INFO  org.apache.pulsar.metadata.impl.AbstractMetadataStore - Deleting path: /managed-ledgers/ourtenant/ourapp/persistent/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75-partition-0 (v. Optional.empty)
Oct 18 06:31:05 pulsar-broker-2 pulsar-broker [BookKeeperClientWorker-OrderedExecutor-0-0] INFO  org.apache.bookkeeper.mledger.impl.MetaStoreImpl - [ourtenant/ourapp/persistent/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75-partition-3] Remove ManagedLedger
Oct 18 06:31:05 pulsar-broker-2 pulsar-broker [BookKeeperClientWorker-OrderedExecutor-0-0] INFO  org.apache.pulsar.metadata.impl.AbstractMetadataStore - Deleting path: /managed-ledgers/ourtenant/ourapp/persistent/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75-partition-3 (v. Optional.empty)
Oct 18 06:31:05 pulsar-broker-2 pulsar-broker [BookKeeperClientWorker-OrderedExecutor-0-0] INFO  org.apache.bookkeeper.mledger.impl.MetaStoreImpl - [ourtenant/ourapp/persistent/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75-partition-4] Remove ManagedLedger
Oct 18 06:31:05 pulsar-broker-2 pulsar-broker [BookKeeperClientWorker-OrderedExecutor-0-0] INFO  org.apache.pulsar.metadata.impl.AbstractMetadataStore - Deleting path: /managed-ledgers/ourtenant/ourapp/persistent/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75-partition-4 (v. Optional.empty)
Oct 18 06:31:05 pulsar-broker-2 pulsar-broker INFO [metadata-store-10-1] INFO  org.apache.pulsar.metadata.impl.AbstractMetadataStore - Deleted path: /managed-ledgers/ourtenant/ourapp/persistent/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75-partition-0 (v. Optional.empty)
Oct 18 06:31:05 pulsar-broker-2 pulsar-broker INFO [metadata-store-10-1] INFO  org.apache.pulsar.metadata.impl.AbstractMetadataStore - Deleted path: /managed-ledgers/ourtenant/ourapp/persistent/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75-partition-3 (v. Optional.empty)
Oct 18 06:31:05 pulsar-broker-2 pulsar-broker INFO [metadata-store-10-1] INFO  org.apache.pulsar.metadata.impl.AbstractMetadataStore - Deleted path: /managed-ledgers/ourtenant/ourapp/persistent/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75-partition-4 (v. Optional.empty)
Oct 18 06:31:05 pulsar-broker-2 pulsar-broker [bookkeeper-ml-scheduler-OrderedScheduler-0-0] INFO  org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [ourtenant/ourapp/persistent/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75-partition-0] Successfully deleted managed ledger
Oct 18 06:31:05 pulsar-broker-2 pulsar-broker [bookkeeper-ml-scheduler-OrderedScheduler-0-0] INFO  org.apache.pulsar.broker.service.persistent.PersistentTopic - [persistent://ourtenant/ourapp/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75-partition-0] Topic deleted
Oct 18 06:31:05 pulsar-broker-2 pulsar-broker [bookkeeper-ml-scheduler-OrderedScheduler-0-0] INFO  org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [ourtenant/ourapp/persistent/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75-partition-3] Successfully deleted managed ledger
Oct 18 06:31:05 pulsar-broker-2 pulsar-broker [bookkeeper-ml-scheduler-OrderedScheduler-0-0] INFO  org.apache.pulsar.broker.service.persistent.PersistentTopic - [persistent://ourtenant/ourapp/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75-partition-3] Topic deleted
Oct 18 06:31:05 pulsar-broker-2 pulsar-broker [bookkeeper-ml-scheduler-OrderedScheduler-0-0] INFO  org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [ourtenant/ourapp/persistent/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75-partition-4] Successfully deleted managed ledger
Oct 18 06:31:05 pulsar-broker-2 pulsar-broker [bookkeeper-ml-scheduler-OrderedScheduler-0-0] INFO  org.apache.pulsar.broker.service.persistent.PersistentTopic - [persistent://ourtenant/ourapp/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75-partition-4] Topic deleted
Oct 18 06:31:06 pulsar-broker-2 pulsar-broker INFO [metadata-store-10-1] INFO  org.apache.pulsar.metadata.cache.impl.MetadataCacheImpl - Update key /admin/partitioned-topics/ourtenant/ourapp/persistent/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75 conflicts. Retrying in 5 ms. Mandatory stop: false. Elapsed time: 1760783466079 ms
Oct 18 06:31:06 pulsar-broker-2 pulsar-broker INFO [metadata-store-10-1] INFO  org.apache.pulsar.metadata.cache.impl.MetadataCacheImpl - Update key /admin/partitioned-topics/ourtenant/ourapp/persistent/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75 conflicts. Retrying in 5 ms. Mandatory stop: false. Elapsed time: 1760783466180 ms
Oct 18 06:31:06 pulsar-broker-2 pulsar-broker INFO [metadata-store-10-1] INFO  org.apache.pulsar.broker.service.persistent.PersistentTopic - [persistent://ourtenant/ourapp/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75] Delete topic metadata failed because another partition exist.
Oct 18 06:31:06 pulsar-broker-2 pulsar-broker INFO [metadata-store-10-1] INFO  org.apache.pulsar.metadata.cache.impl.MetadataCacheImpl - Update key /admin/partitioned-topics/ourtenant/ourapp/persistent/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75 conflicts. Retrying in 10 ms. Mandatory stop: false. Elapsed time: 703 ms
Oct 18 06:31:06 pulsar-broker-2 pulsar-broker INFO [metadata-store-10-1] INFO  org.apache.pulsar.broker.service.persistent.PersistentTopic - [persistent://ourtenant/ourapp/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75] Delete topic metadata failed because another partition exist.
Oct 18 06:31:06 pulsar-broker-2 pulsar-broker INFO [metadata-store-10-1] INFO  org.apache.pulsar.metadata.cache.impl.MetadataCacheImpl - Update key /admin/partitioned-topics/ourtenant/ourapp/persistent/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75 conflicts. Retrying in 5 ms. Mandatory stop: false. Elapsed time: 1760783466983 ms
Oct 18 06:31:07 pulsar-broker-2 pulsar-broker INFO [metadata-store-10-1] INFO  org.apache.pulsar.metadata.cache.impl.MetadataCacheImpl - Update key /admin/partitioned-topics/ourtenant/ourapp/persistent/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75 conflicts. Retrying in 20 ms. Mandatory stop: false. Elapsed time: 902 ms
Oct 18 06:31:07 pulsar-broker-2 pulsar-broker INFO [metadata-store-10-1] INFO  org.apache.pulsar.broker.service.persistent.PersistentTopic - [persistent://ourtenant/ourapp/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75-partition-3] Skip to delete partitioned topic: Another partition exists for [persistent://ourtenant/ourapp/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75].
Oct 18 06:31:07 pulsar-broker-2 pulsar-broker INFO [metadata-store-10-1] INFO  org.apache.pulsar.metadata.cache.impl.MetadataCacheImpl - Update key /admin/partitioned-topics/ourtenant/ourapp/persistent/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75 conflicts. Retrying in 10 ms. Mandatory stop: false. Elapsed time: 184 ms
Oct 18 06:31:07 pulsar-broker-2 pulsar-broker INFO [metadata-store-10-1] INFO  org.apache.pulsar.metadata.cache.impl.MetadataCacheImpl - Update key /admin/partitioned-topics/ourtenant/ourapp/persistent/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75 conflicts. Retrying in 19 ms. Mandatory stop: false. Elapsed time: 213 ms
Oct 18 06:31:07 pulsar-broker-2 pulsar-broker INFO [metadata-store-10-1] INFO  org.apache.pulsar.broker.service.persistent.PersistentTopic - [persistent://ourtenant/ourapp/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75] Delete topic metadata failed because another partition exist.
Oct 18 06:31:07 pulsar-broker-2 pulsar-broker INFO [metadata-store-10-1] INFO  org.apache.pulsar.broker.service.persistent.PersistentTopic - [persistent://ourtenant/ourapp/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75-partition-4] Skip to delete partitioned topic: Another partition exists for [persistent://ourtenant/ourapp/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75].
Oct 18 06:31:07 pulsar-broker-2 pulsar-broker INFO [metadata-store-10-1] INFO  org.apache.pulsar.metadata.cache.impl.MetadataCacheImpl - Update key /admin/partitioned-topics/ourtenant/ourapp/persistent/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75 conflicts. Retrying in 5 ms. Mandatory stop: false. Elapsed time: 1760783467281 ms
Oct 18 06:31:07 pulsar-broker-2 pulsar-broker INFO [metadata-store-10-1] INFO  org.apache.pulsar.broker.service.persistent.PersistentTopic - [persistent://ourtenant/ourapp/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75-partition-0] Skip to delete partitioned topic: Another partition exists for [persistent://ourtenant/ourapp/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75].
Oct 18 06:31:13 pulsar-broker-0 pulsar-broker [pulsar-inactivity-monitor-OrderedScheduler-0-0] INFO  org.apache.pulsar.broker.service.persistent.PersistentTopic - [persistent://ourtenant/ourapp/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75-partition-1] Global topic inactive for 60 seconds, closed repl producers
Oct 18 06:31:13 pulsar-broker-0 pulsar-broker [pulsar-inactivity-monitor-OrderedScheduler-0-0] INFO  org.apache.pulsar.broker.service.persistent.PersistentTopic - [persistent://ourtenant/ourapp/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75-partition-5] Global topic inactive for 60 seconds, closed repl producers
Oct 18 06:31:13 pulsar-broker-0 pulsar-broker [broker-topic-workers-OrderedExecutor-0-0] INFO  org.apache.pulsar.broker.service.BrokerService - Successfully delete authentication policies for topic persistent://ourtenant/ourapp/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75-partition-1
Oct 18 06:31:13 pulsar-broker-0 pulsar-broker [broker-topic-workers-OrderedExecutor-0-0] INFO  org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - ourtenant/ourapp/persistent/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75-partition-1 Moving to FencedForDeletion state
Oct 18 06:31:13 pulsar-broker-0 pulsar-broker [broker-topic-workers-OrderedExecutor-0-0] INFO  org.apache.pulsar.broker.service.BrokerService - Successfully delete authentication policies for topic persistent://ourtenant/ourapp/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75-partition-5
Oct 18 06:31:13 pulsar-broker-0 pulsar-broker [broker-topic-workers-OrderedExecutor-0-0] INFO  org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - ourtenant/ourapp/persistent/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75-partition-5 Moving to FencedForDeletion state
Oct 18 06:31:13 pulsar-broker-0 pulsar-broker [BookKeeperClientWorker-OrderedExecutor-0-0] INFO  org.apache.bookkeeper.mledger.impl.MetaStoreImpl - [ourtenant/ourapp/persistent/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75-partition-1] Remove ManagedLedger
Oct 18 06:31:13 pulsar-broker-0 pulsar-broker [BookKeeperClientWorker-OrderedExecutor-0-0] INFO  org.apache.pulsar.metadata.impl.AbstractMetadataStore - Deleting path: /managed-ledgers/ourtenant/ourapp/persistent/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75-partition-1 (v. Optional.empty)
Oct 18 06:31:13 pulsar-broker-0 pulsar-broker [BookKeeperClientWorker-OrderedExecutor-0-0] INFO  org.apache.bookkeeper.mledger.impl.MetaStoreImpl - [ourtenant/ourapp/persistent/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75-partition-5] Remove ManagedLedger
Oct 18 06:31:13 pulsar-broker-0 pulsar-broker [BookKeeperClientWorker-OrderedExecutor-0-0] INFO  org.apache.pulsar.metadata.impl.AbstractMetadataStore - Deleting path: /managed-ledgers/ourtenant/ourapp/persistent/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75-partition-5 (v. Optional.empty)
Oct 18 06:31:14 pulsar-broker-0 pulsar-broker INFO [metadata-store-10-1] INFO  org.apache.pulsar.metadata.impl.AbstractMetadataStore - Deleted path: /managed-ledgers/ourtenant/ourapp/persistent/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75-partition-1 (v. Optional.empty)
Oct 18 06:31:14 pulsar-broker-0 pulsar-broker INFO [metadata-store-10-1] INFO  org.apache.pulsar.metadata.impl.AbstractMetadataStore - Deleted path: /managed-ledgers/ourtenant/ourapp/persistent/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75-partition-5 (v. Optional.empty)
Oct 18 06:31:14 pulsar-broker-0 pulsar-broker [bookkeeper-ml-scheduler-OrderedScheduler-0-0] INFO  org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [ourtenant/ourapp/persistent/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75-partition-1] Successfully deleted managed ledger
Oct 18 06:31:14 pulsar-broker-0 pulsar-broker [bookkeeper-ml-scheduler-OrderedScheduler-0-0] INFO  org.apache.pulsar.broker.service.persistent.PersistentTopic - [persistent://ourtenant/ourapp/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75-partition-1] Topic deleted
Oct 18 06:31:14 pulsar-broker-0 pulsar-broker [bookkeeper-ml-scheduler-OrderedScheduler-0-0] INFO  org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [ourtenant/ourapp/persistent/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75-partition-5] Successfully deleted managed ledger
Oct 18 06:31:14 pulsar-broker-0 pulsar-broker [bookkeeper-ml-scheduler-OrderedScheduler-0-0] INFO  org.apache.pulsar.broker.service.persistent.PersistentTopic - [persistent://ourtenant/ourapp/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75-partition-5] Topic deleted
Oct 18 06:31:14 pulsar-broker-0 pulsar-broker INFO [metadata-store-10-1] INFO  org.apache.pulsar.metadata.cache.impl.MetadataCacheImpl - Update key /admin/partitioned-topics/ourtenant/ourapp/persistent/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75 conflicts. Retrying in 5 ms. Mandatory stop: false. Elapsed time: 1760783474475 ms
Oct 18 06:36:14 pulsar-broker-0 pulsar-broker WARN [delayer-47-1] WARN  org.apache.pulsar.client.admin.internal.BaseResource - [http://pulsar-broker-0.pulsar-broker.pulsar.svc.cluster.local:8080/admin/v2/persistent/ourtenant/ourapp/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75/partitions?force=false&deleteSchema=true] Failed to perform http delete request: org.apache.pulsar.common.util.FutureUtil$LowOverheadTimeoutException: Request timeout
Oct 18 06:36:14 pulsar-broker-0 pulsar-broker WARN [metadata-store-10-1] WARN  org.apache.pulsar.broker.service.persistent.PersistentTopic - [persistent://ourtenant/ourapp/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75-partition-5] Inactive topic deletion failed
java.util.concurrent.CompletionException: org.apache.pulsar.client.admin.PulsarAdminException: org.apache.pulsar.common.util.FutureUtil$LowOverheadTimeoutException: Request timeout
	at java.base/java.util.concurrent.CompletableFuture.encodeRelay(Unknown Source) ~[?:?]
	at java.base/java.util.concurrent.CompletableFuture.completeRelay(Unknown Source) ~[?:?]
	at java.base/java.util.concurrent.CompletableFuture$UniRelay.tryFire(Unknown Source) ~[?:?]
	at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source) ~[?:?]
	at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source) ~[?:?]
	at org.apache.pulsar.client.admin.internal.BaseResource$4.failed(BaseResource.java:237) ~[org.apache.pulsar-pulsar-client-admin-original-4.0.6.jar:4.0.6]
	at org.glassfish.jersey.client.JerseyInvocation$1.failed(JerseyInvocation.java:898) ~[org.glassfish.jersey.core-jersey-client-2.42.jar:?]
Oct 18 06:36:14 pulsar-broker-0 pulsar-broker WARN [delayer-47-1] WARN  org.apache.pulsar.client.admin.internal.BaseResource - [http://pulsar-broker-0.pulsar-broker.pulsar.svc.cluster.local:8080/admin/v2/persistent/ourtenant/ourapp/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75/partitions?force=false&deleteSchema=true] Failed to perform http delete request: org.apache.pulsar.common.util.FutureUtil$LowOverheadTimeoutException: Request timeout
Oct 18 06:36:14 pulsar-broker-0 pulsar-broker WARN [metadata-store-10-1] WARN  org.apache.pulsar.broker.service.persistent.PersistentTopic - [persistent://ourtenant/ourapp/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75-partition-1] Inactive topic deletion failed
java.util.concurrent.CompletionException: org.apache.pulsar.client.admin.PulsarAdminException: org.apache.pulsar.common.util.FutureUtil$LowOverheadTimeoutException: Request timeout
	at java.base/java.util.concurrent.CompletableFuture.encodeRelay(Unknown Source) ~[?:?]
	at java.base/java.util.concurrent.CompletableFuture.completeRelay(Unknown Source) ~[?:?]
	at java.base/java.util.concurrent.CompletableFuture$UniRelay.tryFire(Unknown Source) ~[?:?]
	at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source) ~[?:?]
	at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source) ~[?:?]
	at org.apache.pulsar.client.admin.internal.BaseResource$4.failed(BaseResource.java:237) ~[org.apache.pulsar-pulsar-client-admin-original-4.0.6.jar:4.0.6]

The logs clearly show that partitions enter the deletion GC around the same time, and most partitions clean up fine (some logs of other partition's successes may have been omitted here). The errors to delete the parent metadata fail as normal until the last partition is gone. However, the metadata update failure appears to never allow that to happen.

Issue Description

What happened

Using auto topic deletion with partitioned topics fails sporadically due to a race condition. The failures cause orphaned topic partitions and the parent topic is never deleted.

What did you expect to happen

All partitions can successfully be deleted, along with the parent metadata, even if they all are GC'd at the same time.

Why is this a bug

The logs clearly show a race condition that orphans 1 or more topic partitions. This causes the topic to never be deleted, and it will never attempt to be GC'd again. This appears to be a distributed bug, as partitions can be held across many brokers all with their own GC loops. Therefore, setting execution threads to 1 isn't a solution either. The value used for jitter is also not exposed, but even if it was, I'm not sure that would be helpful. It seems reasonable that the system's retries that already exist should be able to complete the metadata update at some point, given that it tries for many minutes after the initial failure.

Reproducing the Error

Because this is a race condition, it's not possible to reproduce on demand. However, I see this problem in our production servers (which have hundreds of topics) on a daily basis with up to dozens of orphaned partitions.

Error messages

Oct 18 06:31:14 pulsar-broker-0 pulsar-broker INFO [metadata-store-10-1] INFO  org.apache.pulsar.metadata.cache.impl.MetadataCacheImpl - Update key /admin/partitioned-topics/ourtenant/ourapp/persistent/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75 conflicts. Retrying in 5 ms. Mandatory stop: false. Elapsed time: 1760783474475 ms

Oct 18 06:36:14 pulsar-broker-0 pulsar-broker WARN [delayer-47-1] WARN  org.apache.pulsar.client.admin.internal.BaseResource - [http://pulsar-broker-0.pulsar-broker.pulsar.svc.cluster.local:8080/admin/v2/persistent/ourtenant/ourapp/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75/partitions?force=false&deleteSchema=true] Failed to perform http delete request: org.apache.pulsar.common.util.FutureUtil$LowOverheadTimeoutException: Request timeout

Oct 18 06:36:14 pulsar-broker-0 pulsar-broker WARN [metadata-store-10-1] WARN  org.apache.pulsar.broker.service.persistent.PersistentTopic - [persistent://ourtenant/ourapp/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75-partition-5] Inactive topic deletion failed
java.util.concurrent.CompletionException: org.apache.pulsar.client.admin.PulsarAdminException: org.apache.pulsar.common.util.FutureUtil$LowOverheadTimeoutException: Request timeout

Reproducing the issue

Topic auto deletion is a great feature, but without the ability to rely 100% on all partitions of a topic being deleted, we will not be able to use it because it makes the system unclean.

Additional information

No response

Are you willing to submit a PR?

  • [ ] I'm willing to submit a PR!

darinspivey avatar Oct 21 '25 14:10 darinspivey

Could you please provide more logs between Oct 18 06:31:14 and Oct 18 06:36:14?

Seems persistent://ourtenant/ourapp/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75-partition-1 and persistent://ourtenant/ourapp/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75-partition-5 were fired nearly at the same time, and partition-topic deletion operations were invoked on broker-0 via the Admin API, and then the two operations were failed due to http request timeout. But seems not so easy to figure out the timeout reason.

oneby-wang avatar Oct 23 '25 08:10 oneby-wang

Thank you for the reply. Absolutely I can provide more logs. This afternoon I'll have more logs as the topics should enter GC around 2pm today. I'll find another case and provide what I can. Does the GC process also use the admin api? It must, as I'm the only administrator for this dev cluster, so that wouldn't be run manually.

darinspivey avatar Oct 23 '25 14:10 darinspivey

Does the GC process also use the admin api?

Yes, after I read through the code, I found the partitioned-topic deletion is fired by admin api. If all partitions are deleted, the GC process will fire partitioned-topic deletion operation.

https://github.com/apache/pulsar/blob/88287345d3246fe0d6ea34e06389356ada516cfa/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/persistent/PersistentTopic.java#L3498-L3500

I found the following two partitions were fired nearly at the same time, and failed nearly at the same time after 5min. Maybe this is the race condition.

Oct 18 06:31:14 pulsar-broker-0 pulsar-broker [bookkeeper-ml-scheduler-OrderedScheduler-0-0] INFO  org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [ourtenant/ourapp/persistent/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75-partition-1] Successfully deleted managed ledger
Oct 18 06:31:14 pulsar-broker-0 pulsar-broker [bookkeeper-ml-scheduler-OrderedScheduler-0-0] INFO  org.apache.pulsar.broker.service.persistent.PersistentTopic - [persistent://ourtenant/ourapp/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75-partition-1] Topic deleted
Oct 18 06:31:14 pulsar-broker-0 pulsar-broker [bookkeeper-ml-scheduler-OrderedScheduler-0-0] INFO  org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [ourtenant/ourapp/persistent/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75-partition-5] Successfully deleted managed ledger
Oct 18 06:31:14 pulsar-broker-0 pulsar-broker [bookkeeper-ml-scheduler-OrderedScheduler-0-0] INFO  org.apache.pulsar.broker.service.persistent.PersistentTopic - [persistent://ourtenant/ourapp/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75-partition-5] Topic deleted

It must, as I'm the only administrator for this dev cluster, so that wouldn't be run manually.

As mentioned above, admin api is invoked by GC process through code, you can't control this.

I found many zookeeper update conflicts in logs, but they are in every small retry interval, so I'm not sure how this update operation can fail after 5min retrying.

Oct 18 06:31:07 pulsar-broker-2 pulsar-broker INFO [metadata-store-10-1] INFO  org.apache.pulsar.metadata.cache.impl.MetadataCacheImpl - Update key /admin/partitioned-topics/ourtenant/ourapp/persistent/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75 conflicts. Retrying in 10 ms. Mandatory stop: false. Elapsed time: 184 ms
Oct 18 06:31:07 pulsar-broker-2 pulsar-broker INFO [metadata-store-10-1] INFO  org.apache.pulsar.metadata.cache.impl.MetadataCacheImpl - Update key /admin/partitioned-topics/ourtenant/ourapp/persistent/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75 conflicts. Retrying in 19 ms. Mandatory stop: false. Elapsed time: 213 ms
Oct 18 06:31:14 pulsar-broker-0 pulsar-broker INFO [metadata-store-10-1] INFO  org.apache.pulsar.metadata.cache.impl.MetadataCacheImpl - Update key /admin/partitioned-topics/ourtenant/ourapp/persistent/ourapp.v1.725feb06-ab85-11f0-9600-c297ef652e75 conflicts. Retrying in 5 ms. Mandatory stop: false. Elapsed time: 1760783474475 ms

runWithMarkDeleteAsync will update zookeeper path /admin/partitioned-topics/your-tenant/your-ns/your-topic.

https://github.com/apache/pulsar/blob/88287345d3246fe0d6ea34e06389356ada516cfa/pulsar-broker-common/src/main/java/org/apache/pulsar/broker/resources/NamespaceResources.java#L360-L396

Methods to work around:

  1. Try pulsar-admin topics delete-partitioned-topic, see https://pulsar.apache.org/reference/#/3.0.x/pulsar-admin/topics?id=delete-partitioned-topic. I read through the code, it should be an idempotent operation even if some(or all) partitions are deleted.
  2. If the first method fails, delete zookeeper path /admin/partitioned-topics/your-tenant/your-ns/your-topic, before doing this, make sure all zookeeper paths /managed-ledgers/your-tenant/your-ns/your-topic-partition-n is gced by pulsar, n is your partition num.

oneby-wang avatar Oct 24 '25 01:10 oneby-wang

Ok, I'm going to watch the next few days to try and get some new test cases. If I find some, I'll post gists of the full logs. Currently, I'm also doing some A-B testing. We upgraded to the latest Helm chart in development, which is 4.0.7. I'm seeing some different behavior:

  • On partitions that didn't get any data, those topics are deleted right after the subscriptions are removed. This is just fine, and I believe I saw this functionality noted in the release notes.
  • I am seeing the ZK "can't update key" errors, but it's still unclear if they're actually causing topics to hang. As you mentioned, the api failure (timeout) could be a separate issue.

Try pulsar-admin topics delete-partitioned-topic

I have done this in previous cases of dirty topic deletions, and sometimes it has told me "No such topic" indicating corruption somewhere in the metadata or overall tracking of the topic. I'll include this in future findings if it happens again.

I'll post back here with any more information for both versions 4.0.6 and 4.0.7. Thank you!

darinspivey avatar Oct 24 '25 13:10 darinspivey

This gist highlights 3 cases for version 4.0.6 where topic deletion is failing for some of the reasons we've mentioned:

  • metadata key update failures
  • http timeouts

It should be noted that this cluster is currently idle and there has been no traffic to it for 24 hours. The topics left for the GC were from before its traffic was removed, and I would have expected no problems with it deleting them. It seems that, overall, there are race conditions and deadlocking going on during the delete process across multiple brokers handling multiple partitions. I'll provide more gists for what I'm seeing in version 4.0.7. Thanks!

darinspivey avatar Oct 24 '25 14:10 darinspivey

Try to add the following configs in broker.conf file, and see if this problem can be solved?

  1. add brokerClient_connectionsPerBroker=0(Unpooled).
  2. add brokerClient_connectionsPerBroker=n, n is larger than your max partitions.

oneby-wang avatar Oct 24 '25 15:10 oneby-wang

Will do! Can I add that to configData in the helm chart? (EDIT: yes, that works)

darinspivey avatar Oct 24 '25 15:10 darinspivey

On partitions that didn't get any data, those topics are deleted right after the subscriptions are removed.

You mean partitioned-topic with just one partition or non-partitioned topic? After my analysis, partitioned-topic with just one partition seems still has this problem.

My analysis in case-0 is: broker0 calls broker0 itself, let's assume maxConnection is 1, then broker0 wait itself to release the connection, the result is timeout.

https://github.com/apache/pulsar/blob/313ae974ef01b7ed295a03c93906ccf9daf82fd5/pulsar-client-admin/src/main/java/org/apache/pulsar/client/admin/internal/http/AsyncHttpConnector.java#L149-L158

Call chain: 1.gc process -> 2.delete partitioned-topic admin api(invoked on broker0) -> 3.delete topic admin api(invoked on broker0).

The concurrent race condition maybe another use case, so let' fix the self call problem first. BTW, I'll try to submit a PR to avoid the third admin api call if this analysis works.

oneby-wang avatar Oct 25 '25 00:10 oneby-wang

As you said, yes, it feels like there could be several things going on, so one thing at a time. After 2 days of monitoring, I see that adding brokerClient_connectionsPerBroker=10 might have been helping. I've now seen 2 days of topics getting deleted cleanly (we have a nightly test suite that creates lots of topics which are then deleted the next day by GC--this is how I've been watching). That's great, but I don't want to call it fixed just yet, but for these 2 days, I've having seen the http timeouts or orphaned topics. To be clear, if 0 is used for that value, it does no connection pooling? We could have topics that have 15 partitions in the future, so I'd rather not have it be a static value too low. On the other hand, turning off pooling sounds like a bad idea. Do you have a suggestion there, or can it be something like 20 (which I doubt we'd ever get a topic with that many partitions)?

Your analysis of case0 is interesting--I'm glad you see something to work with there. I'll watch for more cases and post them if there are any. Thanks!

darinspivey avatar Oct 26 '25 19:10 darinspivey

Do you have a suggestion about brokerClient_connectionsPerBroker?

My suggestion:

  1. brokerClient_connectionsPerBroker=0 is ok if you are not so care about performance. I read through the code and find adminClient in pulsar broker usages are very limited, with a relatively low call frequency in background. Unpooled means new tcp connection in every admin api call.
  2. If you want to use adminClient connection pool, just tune brokerClient_connectionsPerBroker into a bigger value. A few more connections in NIO programming won't affects performance too much. And brokerClient_connectionsPerBroker=n is just a safe value. If there are m partitions in a partitioned-topic calling the delete partitioned-topic admin api simultaneously(race condition), you only need to ensure that brokerClient_connectionsPerBroker > m(EDIT: considering topic lookup redirect request, m+1 is already enough). I found it is a little bit difficult to solve this problem considering topic lookup redirect request, which can only be invoked and redirected by adminClient.

but for these 2 days, I've having seen the http timeouts or orphaned topics.

A little bit confused, so the brokerClient_connectionsPerBroker works for you or not? What cases will cause http timeouts or orphaned topics?

I have done this in previous cases of dirty topic deletions, and sometimes it has told me "No such topic" indicating corruption somewhere in the metadata or overall tracking of the topic.

If you encounter this again, please provide some logs. I think pulsar-admin topics delete-partitioned-topic should be an idempotent operation.

oneby-wang avatar Oct 27 '25 02:10 oneby-wang

A little bit confused, so the brokerClient_connectionsPerBroker works for you or not?

Sorry for the confusion. YES, this change appears to have solved the http timeouts, which may be the sole reason for topics being orphaned. I've chosen to definitely use connection pooling, and set that value to 15, which should cover topics with more partitions, and may even be enough to prevent the race if I have partitions > 15. We can see, but as of now, I don't see orphaned topics.

The race condition to update metadata (as originally posted in this report) still appears to happen, even for successful deletions. So, that may be an issue to deal with or not--you'd have to decide that. But I guess the http timeouts were the main cause of topics being orphaned?

On partitions that didn't get any data, those topics are deleted right after the subscriptions are removed.

What I meant by this was the recent fix in #24733 . Since I'm now running pulsar version 4.0.7, I see the effect of that happening, which I think is that for low-throughput topics, not all partitions will have data. The ones that don't appear to have the partition's topic be deleted when the subscriptions delete (and not wait for the rention policy time). That's great, actually--less to manage. Ignore me here if you're confused by this. I was just being verbose :)

Call chain: 1.gc process -> 2.delete partitioned-topic admin api(invoked on broker0) -> 3.delete topic admin api(invoked on broker0). The concurrent race condition maybe another use case, so let' fix the self call problem first. BTW, I'll try to submit a PR to avoid the third admin api call if this analysis works.

Since I think brokerClient_connectionsPerBroker had a positive effect, I'm glad you've identified another area to look at with your comment above. You think you'll have a PR to fix that flow?

Thanks again for your help, we're looking good here I think!

darinspivey avatar Oct 27 '25 20:10 darinspivey

I'm now seeing something new related to the auto-deletion process. I'm seeing topics that have all 6 partitions be deleted, but the parent metadata is not able to be deleted because of an http timeout. I'm not sure if this case would be handled by your open PR or not (thanks for that, by the way!)

I'll say that I've been messing around with writing a script that can detect unloaded topics, as they won't enter the GC loop. While doing that, I had restarted the brokers a few times, and you'll see the shutdown messages in the logs. I'm saying this because it may be a different code path for the "topic not loaded, trying to clean metadata" approach, but the http timeout is consistent in all of the examples.

I've made a new gist that highlights the issue. https://gist.github.com/darinspivey/1963b8a105fb9c7d1feabed73600970e

darinspivey avatar Oct 30 '25 20:10 darinspivey

EDIT: sorry for misleading, just forget what I said before in this comment(that was wrong).

Did you set brokerClient_connectionsPerBroker=15? Please check this config(can find in logs when broker starts up). I can only see 14 request failures in your log. Is there any http call not in the logs at the same time?

In still_happening1.txt, still_happening2.txt and still_happening4.txt, I see some topics are triggered nearly at the same time(Oct 27 16:07:09~Oct 27 16:07:10), are these logs happens at the same time but separated into different files?

If pooled connection is not working, please try brokerClient_connectionsPerBroker=0. If there are many partitioned-topics gc at the same time, it will also cause connection pool deadlock.

Found the source code of limiting concurrent connection acquisition.

https://github.com/apache/pulsar/blob/39bb67542f2a7b849acaff681d408c693e1a2a18/pulsar-client-admin/src/main/java/org/apache/pulsar/client/admin/internal/http/AsyncHttpConnector.java#L391-L400

I'm not sure if this case would be handled by your open PR or not.

Yes, I think it will solve the http timeout exception.

I'm saying this because it may be a different code path for the "topic not loaded, trying to clean metadata" approach

Seems this is kind of expect behavior before this issue is solved, because partition is deleted, broker owning may unload this partition. I'm not very sure :).

https://github.com/apache/pulsar/blob/39bb67542f2a7b849acaff681d408c693e1a2a18/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/BrokerService.java#L1300-L1330

BTW, I'll try to fix this race condition to avoid multiple calls to partitioned-topic deletion admin api after this open PR is merged.

oneby-wang avatar Oct 31 '25 02:10 oneby-wang

Did you set brokerClient_connectionsPerBroker=15

I made a mistake. I DID set it in configData of the values.yaml for the broker, however upon further inspection, I could not find the setting in broker.conf. After more digging, I was not using the helm chart correctly. You have to prefix such settings with PULSAR_PREFIX_ as noted in the helm chart comment.

After doing that, I see that it has been applied:

> kc logs pod/pulsar-broker-1 -n pulsar | grep connectionsPerBroker
Defaulted container "pulsar-broker" out of: pulsar-broker, wait-zookeeper-ready (init), wait-bookkeeper-ready (init)
[conf/broker.conf] Adding config brokerClient_connectionsPerBroker = 15

If there are many partitioned-topics gc at the same time, it will also cause connection pool deadlock.

Yes, because our use of partitions, there will be many topics deleted at the same time which can cause a thundering herd. Mostly this is due to our test suite which creates 100+ topics per run, and stops using them all at the same time when the test suite is done. Therefore, we have roughly 600 partitions deleting around the same time. By using a bigger pool of 15, is it possible that the retries will work if all 15 connections are being used? I would think that some would free up as they complete, and retries may work? Or should we just consider moving to a non-pooled value of 0?

Thank you. I will continue to monitor.

darinspivey avatar Oct 31 '25 15:10 darinspivey

In my opinion, if there are m partitions across partitioned-topics calling the delete partitioned-topic admin api simultaneously(not only race condition, but across multi topics), you need to config brokerClient_connectionsPerBroker > m to ensure at least one would free up as they complete. But it is difficult to estimate m.

oneby-wang avatar Nov 01 '25 00:11 oneby-wang

Update: A week ago, I set brokerClient_connectionsPerBroker = 0 because there will always be a race condition when using partitioned topics, and I'm not sure I could ever figure out what m should be. In a last ditch attempt, I tried turning off pooling. After doing so, I have not seen any orphaned metadata or partitions in our dev environment for the last 7 days.

darinspivey avatar Nov 09 '25 01:11 darinspivey

Hi, @darinspivey, recently, I re-read the creation code of PulsarAdminImpl in PulsarService.

https://github.com/apache/pulsar/blob/9d8bf601749d465e2394a1a0db96bfe6b70d13a5/pulsar-broker/src/main/java/org/apache/pulsar/broker/PulsarService.java#L1813-L1823

The default value of connectionsPerBroker is 16.

https://github.com/apache/pulsar/blob/1ca17972459095278e2b5f7ed7fd55c8921d8826/pulsar-client-admin/src/main/java/org/apache/pulsar/client/admin/internal/PulsarAdminBuilderImpl.java#L48-L51

Our earlier discussions might have had some misleading points. Looks like you lowered the connectionsPerBroker parameter value to 15 and had a positive effect. The number of broker-owned topics and concurrent scenarios may have led us to a wrong conclusion.

I wonder if your logs have been redacted and processed, which is why I am not seeing the complete set of logs. Otherwise, it does not make sense. Or I missed the code setting this parameter elsewhere.

If our understanding doesn’t align, you could use a tool like Arthas to grab the actual value connectionsPerBroker.

https://github.com/apache/pulsar/blob/9d8bf601749d465e2394a1a0db96bfe6b70d13a5/pulsar-client-admin/src/main/java/org/apache/pulsar/client/admin/internal/http/AsyncHttpConnector.java#L194-L204

oneby-wang avatar Dec 07 '25 02:12 oneby-wang

Actually, I have now set the value to 0 to not do pooling at all and I had verified that at some point. That is the change that seemed to have a positive effect--setting it to 15 didn't work to eliminate the timeouts. In providing logs, I definitely cut out some and redacted a few pieces, but I do see this in the logs:

pulsar-broker-0 pulsar-broker [conf/broker.conf] Adding config brokerClient_connectionsPerBroker = 0

I had trouble finding the exact name to use in the helm chart, but PULSAR_PREFIX_brokerClient_connectionsPerBroker: '0' seemed to work. Are you just wanting to verify that value by seeing the full startup log?

darinspivey avatar Dec 09 '25 21:12 darinspivey

Actually, I have now set the value to 0 to not do pooling at all and I had verified that at some point. That is the change that seemed to have a positive effect--setting it to 15 didn't work to eliminate the timeouts.

Yeah, that makes sense.

oneby-wang avatar Dec 10 '25 01:12 oneby-wang