pulsar
pulsar copied to clipboard
[Bug] Bookkeeper returns mac mismatch while reading
Search before asking
- [X] I searched in the issues and found nothing similar.
Version
2.11.3 to 3.1.0
Minimal reproduce step
N/A
What did you expect to see?
No error
What did you see instead?
After bumped I got a lot of exceptions as:
2023-10-23T15:01:29,964+0000 [BookKeeperClientWorker-OrderedExecutor-7-0] INFO org.apache.bookkeeper.client.PendingReadOp - Mac mismatch while reading L16702 E0 from bookie: yo-bookkeeper-c3-n3
2023-10-23T15:01:29,965+0000 [BookKeeperClientWorker-OrderedExecutor-7-0] ERROR org.apache.bookkeeper.client.PendingReadOp - Read of ledger entry failed: L16702 E0-E0, Sent to [yo-bookkeeper-c3-n14, yo-bookkeeper-c3-n3, yo-bookkeeper-c3-n7], Heard from [yo-bookkeeper-c3-n14, yo-bookkeeper-c3-n3, yo-bookkeeper-c3-n7] : bitset = {0, 1, 2}, Error = 'Entry digest does not match'. First unread entry is (-1, rc = null)
2023-10-23T15:01:29,966+0000 [main] ERROR org.apache.bookkeeper.client.BookKeeperAdmin - Error reading entry 0 from ledger 16702
org.apache.bookkeeper.client.BKException$BKDigestMatchException: Entry digest does not match
at org.apache.bookkeeper.client.SyncCallbackUtils.finish(SyncCallbackUtils.java:83) ~[org.apache.bookkeeper-bookkeeper-server-4.16.2.jar:4.16.2]
at org.apache.bookkeeper.client.SyncCallbackUtils$SyncReadCallback.readComplete(SyncCallbackUtils.java:229) ~[org.apache.bookkeeper-bookkeeper-server-4.16.2.jar:4.16.2]
at org.apache.bookkeeper.client.LedgerHandle$4.onFailure(LedgerHandle.java:818) ~[org.apache.bookkeeper-bookkeeper-server-4.16.2.jar:4.16.2]
at org.apache.bookkeeper.common.concurrent.FutureEventListener.accept(FutureEventListener.java:38) ~[org.apache.bookkeeper-bookkeeper-common-4.16.2.jar:4.16.2]
at org.apache.bookkeeper.common.concurrent.FutureEventListener.accept(FutureEventListener.java:26) ~[org.apache.bookkeeper-bookkeeper-common-4.16.2.jar:4.16.2]
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863) ~[?:?]
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841) ~[?:?]
at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:482) ~[?:?]
at org.apache.bookkeeper.common.util.SingleThreadExecutor.safeRunTask(SingleThreadExecutor.java:137) ~[org.apache.bookkeeper-bookkeeper-common-4.16.2.jar:4.16.2]
at org.apache.bookkeeper.common.util.SingleThreadExecutor.run(SingleThreadExecutor.java:107) ~[org.apache.bookkeeper-bookkeeper-common-4.16.2.jar:4.16.2]
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.netty-netty-common-4.1.94.Final.jar:4.1.94.Final]
at java.lang.Thread.run(Thread.java:833) ~[?:?]
org.apache.pulsar.broker.service.schema.exceptions.SchemaException: Entry digest does not match - ledger=103622248 - operation=Failed to read entry - entry=0
at org.apache.pulsar.broker.service.schema.BookkeeperSchemaStorage.bkException(BookkeeperSchemaStorage.java:711)
at org.apache.pulsar.broker.service.schema.BookkeeperSchemaStorage$Functions.lambda$getLedgerEntry$0(BookkeeperSchemaStorage.java:655)
at org.apache.bookkeeper.client.LedgerHandle$4.onFailure(LedgerHandle.java:818)
at org.apache.bookkeeper.common.concurrent.FutureEventListener.accept(FutureEventListener.java:38)
at org.apache.bookkeeper.common.concurrent.FutureEventListener.accept(FutureEventListener.java:26)
at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863)
at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)
at java.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:482)
at org.apache.bookkeeper.common.util.SingleThreadExecutor.safeRunTask(SingleThreadExecutor.java:137)
at org.apache.bookkeeper.common.util.SingleThreadExecutor.run(SingleThreadExecutor.java:107)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:833)
Anything else?
No response
Are you willing to submit a PR?
- [ ] I'm willing to submit a PR!
I'm rollbacking brokers (only) to 2.11.
EDIT: It fixed the issue. So there is an issue with rendering on the client side on v2.11+.
A bit more context
2023-10-23T13:35:47,556+0000 [BookKeeperClientWorker-OrderedExecutor-9-0] ERROR org.apache.bookkeeper.client.PendingReadOp - Read of ledger entry failed: L103556923 E0-E0, Sent to [yo-bookkeeper-c3-n8, yo-bookkeeper-c3-n6, yo-bookkeeper-c3-n4], Heard from [yo-bookkeeper-c3-n8, yo-bookkeeper-c3-n6, yo-bookkeeper-c3-n4] : bitset = {0, 1, 2}, Error = 'Entry digest does not match'. First unread entry is (-1, rc = null)
2023-10-23T13:35:47,556+0000 [BookKeeperClientWorker-OrderedExecutor-9-0] WARN org.apache.pulsar.broker.service.ServerCnx - [/192.168.2.5:48818][persistent://user_421c40fa-009e-433c-a93b-fc77551a2406/logs/__change_events-partition-0][multiTopicsReader-02b7f937a6] Failed to create consumer: consumerId=2104, Entry digest does not match - ledger=103556923 - operation=Failed to read entry - entry=0
Where Read of ledger entry failed: L**103556923** E0-E0 really looks like trying to read an empty ledger.
$ bookkeeper shell ledgermetadata -ledgerid 103556923
2023-10-23T21:28:51,808+0000 [main] INFO org.apache.bookkeeper.tools.cli.commands.client.LedgerMetaDataCommand - LedgerMetadata{formatVersion=3, ensembleSize=5, writeQuorumSize=3, ackQuorumSize=2, state=CLOSED, length=8336, lastEntryId=0, digestType=CRC32C, password=base64:, ensembles={0=[yo-bookkeeper-c3-n4, yo-bookkeeper-c3-n6, yo-bookkeeper-c3-n8, yo-bookkeeper-c3-n9, yo-bookkeeper-c3-n15]}, customMetadata={component=base64:c2NoZW1h, application=base64:cHVsc2Fy, pulsar/schemaId=base64:dXNlcl80MjFjNDBmYS0wMDllLTQzM2MtYTkzYi1mYzc3NTUxYTI0MDYvbG9ncy9fX2NoYW5nZV9ldmVudHM=}}
We hit the same errors on Pulsar v3.0.1
We hit the same errors on Pulsar v3.0.1
@miton18 Do you have any way to reproduce this issue?
We hit the same errors on Pulsar v3.0.1
@miton18 Do you have any way to reproduce this issue?
sadly, no
It happens when upgrading brokers to v3.x.x
We hit the same errors on Pulsar v3.0.1
@miton18 Do you have any way to reproduce this issue?
sadly, no
It happens when upgrading brokers to v3.x.x
Only upgrade broker, doesn't upgrade the bookie, right?
bookies has been upgraded to v3.1.0 without issues
ok, trying to read a ledger from bookkeeper itself also throw the same error...
023-10-26T09:40:26,062+0000 [main] INFO org.apache.bookkeeper.tools.cli.commands.client.LedgerMetaDataCommand - LedgerMetadata{formatVersion=3, ensembleSize=5, writeQuorumSize=3, ackQuorumSize=3, state=CLOSED, length=8336, lastEntryId=0, digestType=CRC32C, password=base64:, ensembles={0=[yo-bookkeeper-c3-n12, yo-bookkeeper-c3-n8, yo-bookkeeper-c3-n14, yo-bookkeeper-c3-n6, yo-bookkeeper-c3-n3]}, customMetadata={component=base64:c2NoZW1h, application=base64:cHVsc2Fy, pulsar/schemaId=base64:b3JnYV8yMWQ3ZjZkMy04Mzk3LTQ3ODItOWE2NS05MWVjYWMxYTQ1OTUvbG9ncy9fX2NoYW5nZV9ldmVudHM=}}
Looks fixed in 3.1.1.
There are some reports on Pulsar Slack that this issue happens when bookkeeperUseV2WireProtocol=false or TLS is enabled for Bookkeeper. @hangc0276 I wonder what the possible bug fix has been if this is already fixed?
Looks fixed in 3.1.1.
@KannarFr Have you checked with 3.0.XX version. Basically Its working, if we set bookkeeperUseV2WireProtocol=true in broker configmap.
No, I didn't but should I use v2 (legacy) wire proto? I mean this can fix the issue, but this is a jump in the past, nope? :D
Looks fixed in 3.1.1.
@KannarFr Have you checked with 3.0.XX version. Basically Its working, if we set bookkeeperUseV2WireProtocol=true in broker configmap.
@lhotari Could you please help me to understand , are there any impact If we make bookkeeperUseV2WireProtocol=true ?
Could you please help me to understand , are there any impact If we make bookkeeperUseV2WireProtocol=true
@Debashish-Mallick bookkeeperUseV2WireProtocol=true is the recommended default with Pulsar:
https://github.com/apache/pulsar/blob/9b643c899cbc99b2ead131aed45388c99615a8e7/conf/broker.conf#L973-L974
However, when using bookkeeperUseV2WireProtocol=true, I believe that TLS is not supported. See https://github.com/apache/bookkeeper/pull/2300 and https://github.com/apache/pulsar-helm-chart/blob/8cb3c18377e547e5de4ca19e63138ca1387aef59/charts/pulsar/templates/broker-configmap.yaml#L131-L147 for more information.
I am not exactly sure about the impact since I haven't tested this configuration myself. My concern is that TLS wouldn't be used when the intention is to use TLS between brokers and bookies.
However, it's possible that https://github.com/apache/bookkeeper/pull/2300 change added support for TLS for v2, but it's hard to know the full context of that change.
One possibility is that it's completely valid and recommended to use bookkeeperUseV2WireProtocol=true, also with TLS after the BK PR 2300 fix that was made a long time ago.
bookkeeperUseV2WireProtocol=false
is mandatory to have a working SQL setup
https://pulsar.apache.org/docs/3.1.x/sql-deployment-configurations/#get-the-last-message-in-a-topic
https://github.com/apache/bookkeeper/pull/4140 has figured out the issue. It's a regression from 3.0.
@hangc0276 @poorbarcode Do I understand correctly that https://github.com/apache/bookkeeper/pull/4140 will be released in bookkeeper version 4.16.4 ? What is the schedule for the Bookkeeper 4.16.4 release with the fix? I also asked at https://lists.apache.org/thread/tcy93c8vrb475j64tdotlklxblrcm728 .
Bookkeeper 4.16.4 has been upgraded in #21983 . Closing this issue since that should contain the fix.
Reopening until latest concerns about other checksum related bugs in BK are resolved.
Pending fix in Bookkeeper will be included in 4.16.5 release. https://github.com/apache/bookkeeper/pull/4196
We are using 4.16.3 and encountering this error with 1-2 events occurring randomly on a daily basis, and we have a workaround to skip the specific position and resend the data to recover it. May I know if you are aware of any other workarounds available before the 4.16.5 release?
We are using 4.16.3 and encountering this error with 1-2 events occurring randomly on a daily basis, and we have a workaround to skip the specific position and resend the data to recover it. May I know if you are aware of any other workarounds available before the 4.16.5 release?
@pqab I'm not aware of a workaround. There's also a concern that there are some other problems (#22103 with some linked BK issue) that cause similar symptoms. BK 4.16.5 has been released and there's PR to upgrade in Pulsar, #22484. This will be included in Pulsar releases 3.0.5 and 3.2.3 . ETA for the release is within a few weeks. In the meantime, it's possible to build custom builds of Pulsar.
I believe that this issue is a duplicate of #22601. I have made a fix to bookkeeper which fixes the issue: https://github.com/apache/bookkeeper/pull/4404