bookkeeper icon indicating copy to clipboard operation
bookkeeper copied to clipboard

Ledger has been in IN_RECOVERY state and cannot be recovered

Open wolfstudy opened this issue 3 years ago • 2 comments

BUG REPORT

Describe the bug

Initially, we found that a topic in Pulsar could not do normal message production and sending operations. Therefore, querying the stats and stats-internal of the current topic returns no results, as follows:

Status:

{
  "msgRateIn" : 0.0,
  "msgThroughputIn" : 0.0,
  "msgRateOut" : 0.0,
  "msgThroughputOut" : 0.0,
  "bytesInCounter" : 0,
  "msgInCounter" : 0,
  "bytesOutCounter" : 0,
  "msgOutCounter" : 0,
  "averageMsgSize" : 0.0,
  "msgChunkPublished" : false,
  "storageSize" : 0,
  "backlogSize" : 0,
  "offloadedStorageSize" : 0,
  "publishers" : [ ],
  "subscriptions" : { },
  "replication" : { },
  "nonContiguousDeletedMessagesRanges" : 0,
  "nonContiguousDeletedMessagesRangesSerializedSize" : 0,
  "publishRateLimitedTimes" : 0,
  "metadata" : {
    "partitions" : 1
  },
  "partitions" : { }
}

Status-internal:

{
  "metadata" : {
    "partitions" : 0
  },
  "partitions" : { }
}

At this point, I went to query Bookie's log and found the following error message:

image

And the source code is here: https://github.com/apache/pulsar/blob/6704f12104/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerFactoryImpl.java#L366-L374

Here we can see that the current Ledger state has not been restored within the time of the default configuration managedLedgerMetadataOperationsTimeoutSeconds=60s.

So we tried to use the readledger command to find that the current Ledger stores nearly 700w of Entries, so when the Broker tries to load the topic, it will open the current ledger, but the Ledger is always in the state of IN_RECOVERY, So it will enter the following code logic:

if (doRecovery) {
            lh.recover(new OrderedGenericCallback<Void>(bk.getMainWorkerPool(), ledgerId) {
                @Override
                public void safeOperationComplete(int rc, Void result) {
                    if (rc == BKException.Code.OK) {
                        openComplete(BKException.Code.OK, lh);
                    } else if (rc == BKException.Code.UnauthorizedAccessException) {
                        closeLedgerHandle();
                        openComplete(BKException.Code.UnauthorizedAccessException, null);
                    } else {
                        closeLedgerHandle();
                        openComplete(bk.getReturnRc(BKException.Code.LedgerRecoveryException), null);
                    }
                }
                @Override
                public String toString() {
                    return String.format("Recover(%d)", ledgerId);
                }
            });
        }

However, since there are many Entry that need to be restored, we can see that the current Ledger has not been able to be restored, so the Topic on the Broker side cannot be loaded correctly, so the entire service has been unavailable.

To Reproduce

I don't know how this phenomenon came about, and I don't know how to reproduce this scene

Solution

We locally mocked a bookie client object, and then re-fetched the properties of LedgerMetadata, forcing the current Ledger state to be reset from IN_RECOVERY to CLOSED state, and the problem recovered

wolfstudy avatar Jul 14 '22 08:07 wolfstudy

are you sure that this is a bookkeeper problem ? do you have any logs from the bookkeeper client ? "org.apache.bookkeeper" loggers (excluding Pulsar mledger logs)

eolivelli avatar Jul 14 '22 12:07 eolivelli

I have a question what is ledger's config EnsembleSize=? WriteQuorum=? AckQuorum=?

StevenLuMT avatar Jul 21 '22 17:07 StevenLuMT