[Bug] Racing Condition between Admin Recovery and Over-Replicated Ledger GC (on Destination BK)

Open heesung-sohn opened this issue 3 months ago • 1 comments

BUG REPORT

Describe the bug bk recover command throws BKLedgerFencedException.

bin/bookkeeper shell recover pulsar-mini-bookie-4.pulsar-mini-bookie.pulsar.svc.cluster.local:3181

bookie:pulsar-mini-bookie-2.pulsar-mini-bookie.pulsar.svc.cluster.local:3181 rc:EOK, ledger:20:entry:110:entryLength:23886
2025-09-25T20:09:23,397+0000 [BookKeeperClientWorker-OrderedExecutor-3-0] DEBUG org.apache.bookkeeper.proto.PerChannelBookieClient - Got Add response from bookie:pulsar-mini-bookie-0.pulsar-mini-bookie.pulsar.svc.cluster.local:3181 rc:EFENCED, ledger:7:entry:3900
2025-09-25T20:09:23,397+0000 [BookKeeperClientWorker-OrderedExecutor-3-0] ERROR org.apache.bookkeeper.client.LedgerFragmentReplicator - BK error writing entry for ledgerId: 7, entryId: 3900, bookie: pulsar-mini-bookie-0.pulsar-mini-bookie.pulsar.svc.cluster.local:3181
org.apache.bookkeeper.client.BKException$BKLedgerFencedException: Ledger has been fenced off. Some other client must have opened it to read
	at org.apache.bookkeeper.client.BKException.create(BKException.java:99) ~[org.apache.bookkeeper-bookkeeper-server-4.16.7.jar:4.16.7]
	at org.apache.bookkeeper.client.LedgerFragmentReplicator$2.writeComplete(LedgerFragmentReplicator.java:357) ~[org.apache.bookkeeper-bookkeeper-server-4.16.7.jar:4.16.7]
	at org.apache.bookkeeper.proto.PerChannelBookieClient$AddCompletion.writeComplete(PerChannelBookieClient.java:2200) ~[org.apache.bookkeeper-bookkeeper-server-4.16.7.jar:4.16.7]
	at org.apache.bookkeeper.proto.PerChannelBookieClient$AddCompletion.handleResponse(PerChannelBookieClient.java:2257) ~[org.apache.bookkeeper-bookkeeper-server-4.16.7.jar:4.16.7]
	at org.apache.bookkeeper.proto.PerChannelBookieClient$AddCompletion.handleV2Response(PerChannelBookieClient.java:2236) ~[org.apache.bookkeeper-bookkeeper-server-4.16.7.jar:4.16.7]
	at org.apache.bookkeeper.proto.PerChannelBookieClient$ReadV2ResponseCallback.run(PerChannelBookieClient.java:1407) ~[org.apache.bookkeeper-bookkeeper-server-4.16.7.jar:4.16.7]
	at org.apache.bookkeeper.common.util.SingleThreadExecutor.safeRunTask(SingleThreadExecutor.java:137) ~[org.apache.bookkeeper-bookkeeper-common-4.16.7.jar:4.16.7]
	at org.apache.bookkeeper.common.util.SingleThreadExecutor.run(SingleThreadExecutor.java:107) ~[org.apache.bookkeeper-bookkeeper-common-4.16.7.jar:4.16.7]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.netty-netty-common-4.1.122.Final.jar:4.1.122.Final]
	at java.lang.Thread.run(Thread.java:840) ~[?:?]

This ledger fenced exception during recovery can be caused by a racing condition between admin ledger recovery and bk’s over-replicated-ledger gc. If admin ledger recovery is slow, and if the over-replicated-ledger gc happens to clean recovering ledgers, this fenced exception could hit and further block the recovery.

To mitigate the issue, we could restart the destination bookies. Also, we could consider disabling this over-replicated-ledger gc during the decommission process(by gcOverreplicatedLedgerWaitTime=-1).

To Reproduce

Steps to reproduce the behavior:

pulsar cluster with e.g. 5 bookies, ES=WQ=2 with the following configs.

bookkeeper:
  podMonitor:
    enabled: false
  replicaCount: 5
  configData:
    # minimal memory use for bookkeeper
    # https://bookkeeper.apache.org/docs/reference/config#db-ledger-storage-settings
    dbStorage_writeCacheMaxSizeMb: "32"
    dbStorage_readAheadCacheMaxSizeMb: "32"
    dbStorage_rocksDB_writeBufferSizeMB: "8"
    dbStorage_rocksDB_blockCacheSize: "8388608"
    httpServerEnabled: "true"
    gcWaitTime: "60000"
    gcOverreplicatedLedgerWaitTime: "30000"
    openLedgerRereplicationGracePeriod: "100"

broker:
  podMonitor:
    enabled: false
  replicaCount: 2
  configData:
    ## Enable `autoSkipNonRecoverableData` since bookkeeper is running
    ## without persistence
    autoSkipNonRecoverableData: "true"
    # storage settings
    managedLedgerDefaultEnsembleSize: "2"
    managedLedgerDefaultWriteQuorum: "2"
    managedLedgerDefaultAckQuorum: "2"
    managedLedgerMaxEntriesPerLedger: "50"
    managedLedgerMinLedgerRolloverTimeMinutes: "1"

# producer
bin/pulsar-admin tenants create my-tenant
bin/pulsar-admin namespaces create my-tenant/my-ns
bin/pulsar-admin namespaces set-retention my-tenant/my-ns --size -1 --time -1
bin/pulsar-admin topics create-partitioned-topic persistent://my-tenant/my-ns/test-topic2 -p 10
bin/pulsar-admin topics list my-tenant/my-ns
bin/pulsar-admin namespaces get-retention my-tenant/my-ns
bin/pulsar-perf produce persistent://my-tenant/my-ns/test-topic2 -r 10000


# recover command on bk-4
curl -X PUT -H "Content-Type: application/json" -d '{"readOnly": true}' http://localhost:8000/api/v1/bookie/state/readonly
bin/bookkeeper shell recover pulsar-mini-bookie-4.pulsar-mini-bookie.pulsar.svc.cluster.local:3181  -rate 1000000

This will simulate this case:

t0: L1.metadata = F1:[bk1,bk3], F2:[bk1,bk3]. L1 is replicated to bk1 and bk3 and has two fragments F1 and F2. t1: admin recover started with src bk3(to decommission bk3). t2: admin recover copied L1:F1 to bk2. t3: bk2 over-replicated cleanup kicked-in (timeout by gcOverreplicatedLedgerWaitTime) t4: bk2 sees L1 in its local active ledger list but the L1.metadata still shows L1.F1 ensemble is [bk1,bk3] (recovery ledger copy and metadata update are NOT transactional). t5: bk2 removed L1 locally as it thinks that L1 is over-replicated on bk2. t6: bk2 put L1 in recentlyFencedAndDeletedLedgers at markIfConflictWritingOccurs (ledgerDescriptor.isFenced() throws exception because we remove the ledger index first. ) t7: the admin recover continued and tried to copy L1:F2 to bk2. It hit the ledger fenced exception from recentlyFencedAndDeletedLedgers check(L1).

Expected behavior The bk recover command finishes without exception.

Screenshots

Additional context

Sep 25 '25 23:09 heesung-sohn

One fix idea is that bookies can skip the over-replication gc for the ledgers owned by "read-only" and "draining" bookies, because ledgers from such bookies might be actively replicated to other bookies by the admin recover command. The readonly update api could pass an additional Bookie property, "draining:true" to the bookie info, to share this draining state with other bookies. Then, we can update the isNotBookieIncludedInLedgerEnsembles check like the following.

private boolean isNotBookieIncludedInLedgerEnsembles(Versioned<LedgerMetadata> metadata, Set<BookieId> readOnlyAndDrainingBookies) {
        // do not delete a ledger that is not closed, since the ensemble might
        // change again and include the current bookie while we are deleting it
        if (!metadata.getValue().isClosed()) {
            return false;
        }

        SortedMap<Long, ? extends List<BookieId>> ensembles =
                metadata.getValue().getAllEnsembles();
        for (List<BookieId> ensemble : ensembles.values()) {
            // <------------add the logic here ------>
            // this ledger might be being replicated to this bk, skip.
            for(BookieId bookieId : ensemble) {
                readOnlyAndDrainingBookies.contains(bookieId);
                return false;
            }
            if (ensemble.contains(selfBookieAddress) ) {
                return false;
            }
        }

        return true;
    }

Oct 01 '25 00:10 heesung-sohn