pulsar icon indicating copy to clipboard operation
pulsar copied to clipboard

[Bug] Ledger can not recover with Digest Mismatch Error

Open TakaHiR07 opened this issue 8 months ago • 5 comments

Search before asking

  • [X] I searched in the issues and found nothing similar.

Read release policy

  • [X] I understand that unsupported versions don't get bug fixes. I will attempt to reproduce the issue on a supported version of Pulsar client and Pulsar broker.

Version

client version : 2.9.5 broker version : 3.0.5 bookie version : 4.16.5

Minimal reproduce step

We are doing perf test in pulsar new version.

use pulsar-perf to produce to a topic with 100 partitions, and simulate that broker restart frequently because of direct memory OOM.

However, after running for long time, the topic become unavailable. And we find that one partition can not recover because of digest mismatch.

We can reproduce this issue, it occurs almost every three days.

broker OOM : 企业微信截图_da32b943-b638-4290-80b6-cd9ee9998383

topic become unavailable : 企业微信截图_59e63b26-3787-4278-afea-7180f75f061e

What did you expect to see?

ledger recover success.

What did you see instead?

From the broker log, we see that ledger 6877081 can not recover with digest mismatch. ledger is wrote into 3 quorums without ensemble-change, the config E-Qw-Qa is 3-3-2.

2024-06-04T12:12:17,773+0800 [BookKeeperClientWorker-OrderedExecutor-12-0] ERROR org.apache.bookkeeper.proto.checksum.DigestManager - Mac mismatch for ledger-id: 6877081, entry-id: -1
2024-06-04T12:12:17,774+0800 [BookKeeperClientWorker-OrderedExecutor-12-0] ERROR org.apache.bookkeeper.client.ReadLastConfirmedOp - Mac mismatch for ledger: 6877081, entry: -1 while reading last entry from bookie: ip3:3185
2024-06-04T12:12:17,777+0800 [BookKeeperClientWorker-OrderedExecutor-12-0] ERROR org.apache.bookkeeper.proto.checksum.DigestManager - Mac mismatch for ledger-id: 6877081, entry-id: -1
2024-06-04T12:12:17,777+0800 [BookKeeperClientWorker-OrderedExecutor-12-0] ERROR org.apache.bookkeeper.client.ReadLastConfirmedOp - Mac mismatch for ledger: 6877081, entry: -1 while reading last entry from bookie: ip1:3181
2024-06-04T12:12:17,781+0800 [BookKeeperClientWorker-OrderedExecutor-12-0] ERROR org.apache.bookkeeper.proto.checksum.DigestManager - Mac mismatch for ledger-id: 6877081, entry-id: -1
2024-06-04T12:12:17,781+0800 [BookKeeperClientWorker-OrderedExecutor-12-0] ERROR org.apache.bookkeeper.client.ReadLastConfirmedOp - Mac mismatch for ledger: 6877081, entry: -1 while reading last entry from bookie: ip2:3188

2024-06-04T12:12:17,781+0800 [BookKeeperClientWorker-OrderedExecutor-12-0] ERROR org.apache.bookkeeper.client.ReadLastConfirmedOp - While readLastConfirmed ledger: 6877081 did not hear success responses from all quorums, QuorumCoverage(e:3,w:3,a:2) = [0, 0, 0]
2024-06-04T12:12:17,781+0800 [BookKeeperClientWorker-OrderedExecutor-11-0] ERROR org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [xxx/xxx/persistent/xxxxx-partition-0] Failed to open ledger 6877081: Error while recovering ledger

We have added so many logs to trace the entry, both in broker and bookie. And we add the calculation of entry's md5, aiming to find out where the entry has been changed. The logs and description are as follows.

broker's log :

// entry-326.  broker has wrote 3 same entry into netty channel
2024-06-04T11:42:47,140+0800 [BookKeeperClientWorker-OrderedExecutor-12-0] INFO  org.apache.bookkeeper.proto.PerChannelBookieClient - wtx. lId:6877081, eId:326, rBytes:280374, ctx:ip1/ip1:3181, isAct:true
2024-06-04T11:42:47,281+0800 [pulsar-io-23-10] INFO  org.apache.bookkeeper.util.ByteBufList$Encoder - enc wte. md5:14a10e2af87517e18ae7af4fb0837b26, rBytes:24,36,280338,
2024-06-04T11:42:47,348+0800 [pulsar-io-23-19] INFO  org.apache.bookkeeper.util.ByteBufList$Encoder - enc wte. md5:14a10e2af87517e18ae7af4fb0837b26, rBytes:24,36,280338,
2024-06-04T11:42:47,431+0800 [pulsar-io-23-1] INFO  org.apache.bookkeeper.util.ByteBufList$Encoder - enc wte. md5:14a10e2af87517e18ae7af4fb0837b26, rBytes:24,36,280338,

// entry-332. only 2 encoder has wrote 2 same entry into netty channel. This is corresponding to bookie logs, two bookie has wrote 332. However, one bookie' entry is with error md5.
2024-06-04T11:42:47,264+0800 [BookKeeperClientWorker-OrderedExecutor-12-0] DEBUG org.apache.bookkeeper.proto.checksum.DigestManager - lId:6877081, eId:332, toSend.byt:288707. head.byt:36, head.cap:36, head.ind:0  data.byt:288671, data.cap:524288, data.ind:26 wrap.byt:288671, wrap.cap:524288, wrap.ind:26,
2024-06-04T11:42:47,390+0800 [pulsar-io-23-19] INFO  org.apache.bookkeeper.util.ByteBufList$Encoder - enc wte. md5:0ba0cfe0a90b0b9185bb98cf0a62fef1, rBytes:24,36,288671,
2024-06-04T11:42:47,451+0800 [pulsar-io-23-20] INFO  org.apache.bookkeeper.util.ByteBufList$Encoder - enc wte. md5:0ba0cfe0a90b0b9185bb98cf0a62fef1, rBytes:24,36,288671,


// entry-336. No encoder log.
2024-06-04T11:42:47,276+0800 [BookKeeperClientWorker-OrderedExecutor-12-0] DEBUG org.apache.bookkeeper.proto.checksum.DigestManager - lId:6877081, eId:336, toSend.byt:286996. head.byt:36, head.cap:36, head.ind:0  data.byt:286960, data.cap:1048576, data.ind:577218 wrap.byt:286960, wrap.cap:1048576, wrap.ind:577218,

// broker OOM, hang out and restart
2024-06-04T11:42:47,462+0800 [BookKeeperClientWorker-OrderedExecutor-7-0] DEBUG org.apache.bookkeeper.proto.checksum.DigestManager - lId:6877095, eId:103, toSend.byt:295125. head.byt:36, head.cap:36, head.ind:0  data.byt:295089, data.cap:2097152, data.ind:862513 wrap.byt:295089, wrap.

2024-06-04T11:43:58,125+0800 [main] INFO  org.apache.pulsar.broker.PulsarService - Starting Pulsar Broker service

ip1:3181 (bookie1)

// finally write to entry-327
2024-06-04T11:42:47,464+0800 [bookie-io-1-6] DEBUG org.apache.bookkeeper.bookie.storage.ldb.SingleDirectoryDbLedgerStorage - Add entry. 6877081@326, lac:315, rBytes:280374, isRec:false, ctx:/broker-ip:54828, isAct:true, md5:14a10e2af87517e18ae7af4fb0837b26
2024-06-04T11:42:48,110+0800 [bookie-io-1-6] DEBUG org.apache.bookkeeper.bookie.storage.ldb.SingleDirectoryDbLedgerStorage - Add entry. 6877081@327, lac:315, rBytes:276621, isRec:false, ctx:/broker-ip:54828, isAct:true, md5:4041e2e6d9db97cc87879210fed093bb

ip2:3188(bookie2)

// finally write into entry-332
2024-06-04T11:42:47,388+0800 [bookie-io-1-7] DEBUG org.apache.bookkeeper.bookie.storage.ldb.SingleDirectoryDbLedgerStorage - Add entry. 6877081@326, lac:315, rBytes:280374, isRec:false, ctx:/broker-ip:46648, isAct:true, md5:14a10e2af87517e18ae7af4fb0837b26
2024-06-04T11:42:47,390+0800 [bookie-io-1-7] DEBUG org.apache.bookkeeper.bookie.storage.ldb.SingleDirectoryDbLedgerStorage - Add entry. 6877081@327, lac:315, rBytes:276621, isRec:false, ctx:/broker-ip:46648, isAct:true, md5:4f8e094d002ca5f0c56166fd9648f679

2024-06-04T11:42:48,086+0800 [bookie-io-1-7] DEBUG org.apache.bookkeeper.bookie.storage.ldb.SingleDirectoryDbLedgerStorage - Add entry. 6877081@332, lac:315, rBytes:288707, isRec:false, ctx:/broker-ip:46648, isAct:true, md5:3fb182464e4e3ba2d286b8cd62349eab

ip3:3185(bookie3)

// finally write into entry-336
2024-06-04T11:42:47,350+0800 [bookie-io-1-4] DEBUG org.apache.bookkeeper.bookie.storage.ldb.SingleDirectoryDbLedgerStorage - Add entry. 6877081@326, lac:315, rBytes:280374, isRec:false, ctx:/broker-ip:40636, isAct:true, md5:14a10e2af87517e18ae7af4fb0837b26
2024-06-04T11:42:47,351+0800 [bookie-io-1-4] DEBUG org.apache.bookkeeper.bookie.storage.ldb.SingleDirectoryDbLedgerStorage - Add entry. 6877081@327, lac:315, rBytes:276621, isRec:false, ctx:/broker-ip:40636, isAct:true, md5:4f8e094d002ca5f0c56166fd9648f679

2024-06-04T11:42:47,403+0800 [bookie-io-1-4] DEBUG org.apache.bookkeeper.bookie.storage.ldb.SingleDirectoryDbLedgerStorage - Add entry. 6877081@332, lac:315, rBytes:288707, isRec:false, ctx:/broker-ip:40636, isAct:true, md5:0ba0cfe0a90b0b9185bb98cf0a62fef1
2024-06-04T11:42:48,978+0800 [bookie-io-1-4] DEBUG org.apache.bookkeeper.bookie.storage.ldb.SingleDirectoryDbLedgerStorage - Add entry. 6877081@336, lac:315, rBytes:286996, isRec:false, ctx:/broker-ip:40636, isAct:true, md5:ca496cb654fb638e9d883b3a58bef66f

From the above log, we find that

  • entry'md5 is correct in broker. But between "broker encoder -> netty channel -> bookie", the entry's md5 become error.
  • this case may occur when broker OOM and hang out and restart .

Anything else?

No response

Are you willing to submit a PR?

  • [ ] I'm willing to submit a PR!

TakaHiR07 avatar Jun 14 '24 07:06 TakaHiR07