pulsar icon indicating copy to clipboard operation
pulsar copied to clipboard

[Bug] Bookkeeper returns mac mismatch while reading

Open KannarFr opened this issue 2 years ago • 19 comments

Search before asking

  • [X] I searched in the issues and found nothing similar.

Version

2.11.3 to 3.1.0

Minimal reproduce step

N/A

What did you expect to see?

No error

What did you see instead?

After bumped I got a lot of exceptions as:

2023-10-23T15:01:29,964+0000 [BookKeeperClientWorker-OrderedExecutor-7-0] INFO  org.apache.bookkeeper.client.PendingReadOp - Mac mismatch while reading L16702 E0 from bookie: yo-bookkeeper-c3-n3
2023-10-23T15:01:29,965+0000 [BookKeeperClientWorker-OrderedExecutor-7-0] ERROR org.apache.bookkeeper.client.PendingReadOp - Read of ledger entry failed: L16702 E0-E0, Sent to [yo-bookkeeper-c3-n14, yo-bookkeeper-c3-n3, yo-bookkeeper-c3-n7], Heard from [yo-bookkeeper-c3-n14, yo-bookkeeper-c3-n3, yo-bookkeeper-c3-n7] : bitset = {0, 1, 2}, Error = 'Entry digest does not match'. First unread entry is (-1, rc = null)
2023-10-23T15:01:29,966+0000 [main] ERROR org.apache.bookkeeper.client.BookKeeperAdmin - Error reading entry 0 from ledger 16702
org.apache.bookkeeper.client.BKException$BKDigestMatchException: Entry digest does not match
        at org.apache.bookkeeper.client.SyncCallbackUtils.finish(SyncCallbackUtils.java:83) ~[org.apache.bookkeeper-bookkeeper-server-4.16.2.jar:4.16.2]
        at org.apache.bookkeeper.client.SyncCallbackUtils$SyncReadCallback.readComplete(SyncCallbackUtils.java:229) ~[org.apache.bookkeeper-bookkeeper-server-4.16.2.jar:4.16.2]
        at org.apache.bookkeeper.client.LedgerHandle$4.onFailure(LedgerHandle.java:818) ~[org.apache.bookkeeper-bookkeeper-server-4.16.2.jar:4.16.2]
        at org.apache.bookkeeper.common.concurrent.FutureEventListener.accept(FutureEventListener.java:38) ~[org.apache.bookkeeper-bookkeeper-common-4.16.2.jar:4.16.2]
        at org.apache.bookkeeper.common.concurrent.FutureEventListener.accept(FutureEventListener.java:26) ~[org.apache.bookkeeper-bookkeeper-common-4.16.2.jar:4.16.2]
        at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863) ~[?:?]
        at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841) ~[?:?]
        at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:482) ~[?:?]
        at org.apache.bookkeeper.common.util.SingleThreadExecutor.safeRunTask(SingleThreadExecutor.java:137) ~[org.apache.bookkeeper-bookkeeper-common-4.16.2.jar:4.16.2]
        at org.apache.bookkeeper.common.util.SingleThreadExecutor.run(SingleThreadExecutor.java:107) ~[org.apache.bookkeeper-bookkeeper-common-4.16.2.jar:4.16.2]
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.netty-netty-common-4.1.94.Final.jar:4.1.94.Final]
        at java.lang.Thread.run(Thread.java:833) ~[?:?]
org.apache.pulsar.broker.service.schema.exceptions.SchemaException: Entry digest does not match -  ledger=103622248 - operation=Failed to read entry - entry=0
        at org.apache.pulsar.broker.service.schema.BookkeeperSchemaStorage.bkException(BookkeeperSchemaStorage.java:711)
        at org.apache.pulsar.broker.service.schema.BookkeeperSchemaStorage$Functions.lambda$getLedgerEntry$0(BookkeeperSchemaStorage.java:655)
        at org.apache.bookkeeper.client.LedgerHandle$4.onFailure(LedgerHandle.java:818)
        at org.apache.bookkeeper.common.concurrent.FutureEventListener.accept(FutureEventListener.java:38)
        at org.apache.bookkeeper.common.concurrent.FutureEventListener.accept(FutureEventListener.java:26)
        at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863)
        at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)
        at java.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:482)
        at org.apache.bookkeeper.common.util.SingleThreadExecutor.safeRunTask(SingleThreadExecutor.java:137)
        at org.apache.bookkeeper.common.util.SingleThreadExecutor.run(SingleThreadExecutor.java:107)
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.base/java.lang.Thread.run(Thread.java:833)

Anything else?

No response

Are you willing to submit a PR?

  • [ ] I'm willing to submit a PR!

KannarFr avatar Oct 23 '23 15:10 KannarFr

I'm rollbacking brokers (only) to 2.11.

EDIT: It fixed the issue. So there is an issue with rendering on the client side on v2.11+.

KannarFr avatar Oct 23 '23 15:10 KannarFr

A bit more context

2023-10-23T13:35:47,556+0000 [BookKeeperClientWorker-OrderedExecutor-9-0] ERROR org.apache.bookkeeper.client.PendingReadOp - Read of ledger entry failed: L103556923 E0-E0, Sent to [yo-bookkeeper-c3-n8, yo-bookkeeper-c3-n6, yo-bookkeeper-c3-n4], Heard from [yo-bookkeeper-c3-n8, yo-bookkeeper-c3-n6, yo-bookkeeper-c3-n4] : bitset = {0, 1, 2}, Error = 'Entry digest does not match'. First unread entry is (-1, rc = null)
2023-10-23T13:35:47,556+0000 [BookKeeperClientWorker-OrderedExecutor-9-0] WARN  org.apache.pulsar.broker.service.ServerCnx - [/192.168.2.5:48818][persistent://user_421c40fa-009e-433c-a93b-fc77551a2406/logs/__change_events-partition-0][multiTopicsReader-02b7f937a6] Failed to create consumer: consumerId=2104, Entry digest does not match -  ledger=103556923 - operation=Failed to read entry - entry=0

Where Read of ledger entry failed: L**103556923** E0-E0 really looks like trying to read an empty ledger.

$ bookkeeper shell ledgermetadata -ledgerid 103556923
2023-10-23T21:28:51,808+0000 [main] INFO  org.apache.bookkeeper.tools.cli.commands.client.LedgerMetaDataCommand - LedgerMetadata{formatVersion=3, ensembleSize=5, writeQuorumSize=3, ackQuorumSize=2, state=CLOSED, length=8336, lastEntryId=0, digestType=CRC32C, password=base64:, ensembles={0=[yo-bookkeeper-c3-n4, yo-bookkeeper-c3-n6, yo-bookkeeper-c3-n8, yo-bookkeeper-c3-n9, yo-bookkeeper-c3-n15]}, customMetadata={component=base64:c2NoZW1h, application=base64:cHVsc2Fy, pulsar/schemaId=base64:dXNlcl80MjFjNDBmYS0wMDllLTQzM2MtYTkzYi1mYzc3NTUxYTI0MDYvbG9ncy9fX2NoYW5nZV9ldmVudHM=}}

KannarFr avatar Oct 23 '23 21:10 KannarFr

We hit the same errors on Pulsar v3.0.1

miton18 avatar Oct 26 '23 09:10 miton18

We hit the same errors on Pulsar v3.0.1

@miton18 Do you have any way to reproduce this issue?

hangc0276 avatar Oct 26 '23 09:10 hangc0276

We hit the same errors on Pulsar v3.0.1

@miton18 Do you have any way to reproduce this issue?

sadly, no

It happens when upgrading brokers to v3.x.x

miton18 avatar Oct 26 '23 09:10 miton18

We hit the same errors on Pulsar v3.0.1

@miton18 Do you have any way to reproduce this issue?

sadly, no

It happens when upgrading brokers to v3.x.x

Only upgrade broker, doesn't upgrade the bookie, right?

hangc0276 avatar Oct 26 '23 09:10 hangc0276

bookies has been upgraded to v3.1.0 without issues

miton18 avatar Oct 26 '23 09:10 miton18

ok, trying to read a ledger from bookkeeper itself also throw the same error...

023-10-26T09:40:26,062+0000 [main] INFO org.apache.bookkeeper.tools.cli.commands.client.LedgerMetaDataCommand - LedgerMetadata{formatVersion=3, ensembleSize=5, writeQuorumSize=3, ackQuorumSize=3, state=CLOSED, length=8336, lastEntryId=0, digestType=CRC32C, password=base64:, ensembles={0=[yo-bookkeeper-c3-n12, yo-bookkeeper-c3-n8, yo-bookkeeper-c3-n14, yo-bookkeeper-c3-n6, yo-bookkeeper-c3-n3]}, customMetadata={component=base64:c2NoZW1h, application=base64:cHVsc2Fy, pulsar/schemaId=base64:b3JnYV8yMWQ3ZjZkMy04Mzk3LTQ3ODItOWE2NS05MWVjYWMxYTQ1OTUvbG9ncy9fX2NoYW5nZV9ldmVudHM=}}

miton18 avatar Oct 26 '23 09:10 miton18

Looks fixed in 3.1.1.

KannarFr avatar Oct 30 '23 22:10 KannarFr

There are some reports on Pulsar Slack that this issue happens when bookkeeperUseV2WireProtocol=false or TLS is enabled for Bookkeeper. @hangc0276 I wonder what the possible bug fix has been if this is already fixed?

lhotari avatar Nov 09 '23 11:11 lhotari

Looks fixed in 3.1.1.

@KannarFr Have you checked with 3.0.XX version. Basically Its working, if we set bookkeeperUseV2WireProtocol=true in broker configmap.

Debashish-Mallick avatar Nov 14 '23 07:11 Debashish-Mallick

No, I didn't but should I use v2 (legacy) wire proto? I mean this can fix the issue, but this is a jump in the past, nope? :D

KannarFr avatar Nov 14 '23 12:11 KannarFr

Looks fixed in 3.1.1.

@KannarFr Have you checked with 3.0.XX version. Basically Its working, if we set bookkeeperUseV2WireProtocol=true in broker configmap.

@lhotari Could you please help me to understand , are there any impact If we make bookkeeperUseV2WireProtocol=true ?

Debashish-Mallick avatar Nov 20 '23 06:11 Debashish-Mallick

Could you please help me to understand , are there any impact If we make bookkeeperUseV2WireProtocol=true

@Debashish-Mallick bookkeeperUseV2WireProtocol=true is the recommended default with Pulsar:

https://github.com/apache/pulsar/blob/9b643c899cbc99b2ead131aed45388c99615a8e7/conf/broker.conf#L973-L974

However, when using bookkeeperUseV2WireProtocol=true, I believe that TLS is not supported. See https://github.com/apache/bookkeeper/pull/2300 and https://github.com/apache/pulsar-helm-chart/blob/8cb3c18377e547e5de4ca19e63138ca1387aef59/charts/pulsar/templates/broker-configmap.yaml#L131-L147 for more information.

I am not exactly sure about the impact since I haven't tested this configuration myself. My concern is that TLS wouldn't be used when the intention is to use TLS between brokers and bookies.

However, it's possible that https://github.com/apache/bookkeeper/pull/2300 change added support for TLS for v2, but it's hard to know the full context of that change. One possibility is that it's completely valid and recommended to use bookkeeperUseV2WireProtocol=true, also with TLS after the BK PR 2300 fix that was made a long time ago.

lhotari avatar Nov 20 '23 10:11 lhotari

bookkeeperUseV2WireProtocol=false is mandatory to have a working SQL setup https://pulsar.apache.org/docs/3.1.x/sql-deployment-configurations/#get-the-last-message-in-a-topic

miton18 avatar Nov 20 '23 13:11 miton18

https://github.com/apache/bookkeeper/pull/4140 has figured out the issue. It's a regression from 3.0.

codelipenghui avatar Dec 02 '23 01:12 codelipenghui

@hangc0276 @poorbarcode Do I understand correctly that https://github.com/apache/bookkeeper/pull/4140 will be released in bookkeeper version 4.16.4 ? What is the schedule for the Bookkeeper 4.16.4 release with the fix? I also asked at https://lists.apache.org/thread/tcy93c8vrb475j64tdotlklxblrcm728 .

lhotari avatar Jan 17 '24 19:01 lhotari

Bookkeeper 4.16.4 has been upgraded in #21983 . Closing this issue since that should contain the fix.

lhotari avatar Jan 29 '24 20:01 lhotari

Reopening until latest concerns about other checksum related bugs in BK are resolved.

lhotari avatar Jan 29 '24 23:01 lhotari

Pending fix in Bookkeeper will be included in 4.16.5 release. https://github.com/apache/bookkeeper/pull/4196

lhotari avatar Mar 29 '24 11:03 lhotari

We are using 4.16.3 and encountering this error with 1-2 events occurring randomly on a daily basis, and we have a workaround to skip the specific position and resend the data to recover it. May I know if you are aware of any other workarounds available before the 4.16.5 release?

pqab avatar Apr 10 '24 06:04 pqab

We are using 4.16.3 and encountering this error with 1-2 events occurring randomly on a daily basis, and we have a workaround to skip the specific position and resend the data to recover it. May I know if you are aware of any other workarounds available before the 4.16.5 release?

@pqab I'm not aware of a workaround. There's also a concern that there are some other problems (#22103 with some linked BK issue) that cause similar symptoms. BK 4.16.5 has been released and there's PR to upgrade in Pulsar, #22484. This will be included in Pulsar releases 3.0.5 and 3.2.3 . ETA for the release is within a few weeks. In the meantime, it's possible to build custom builds of Pulsar.

lhotari avatar Apr 11 '24 14:04 lhotari

I believe that this issue is a duplicate of #22601. I have made a fix to bookkeeper which fixes the issue: https://github.com/apache/bookkeeper/pull/4404

lhotari avatar May 28 '24 23:05 lhotari