[Bug] parseMessageMetadata error when broker entry metadata enable with high loading
Search before asking
- [X] I searched in the issues and found nothing similar.
Read release policy
- [X] I understand that unsupported versions don't get bug fixes. I will attempt to reproduce the issue on a supported version of Pulsar client and Pulsar broker.
Version
- 3.2.2
- 3.1.2
Minimal reproduce step
publish event in about 6k QPS and 100Mbits/sec with metaData BatcherBuilder.KEY_BASED mode and producer and send message by high concurrent/parallel producer process. it happens only in almost real time consumer (almost zero backlog)
What did you expect to see?
no lost event
What did you see instead?
could see error log in broker and show Failed to peek sticky key from the message metadata
it look like thread safe issue, because it happen randomly. in 1M events, it only happen few times but the consumer will lose few events
Anything else?
the error similar to https://github.com/apache/pulsar/issues/10967 but I think it's different issue.
the data in bookkeeper is correct. I can download the event from bookkeeper and parse it successfully. or consume the same event few minutes later and it could consume successfully. but all subscriptions will get the same error in the same event in real time consumer(zero backlog).
I have traced source code. it happens in PersistentDispatcherMultipleConsumers.readEntriesComplete -> AbstractBaseDispatcher.filterEntriesForConsumer -> Commands.peekAndCopyMessageMetadata
and I also print the ByteBuf contents, it's I could clearly see the data isn't the same in bookkeeper
in normal event , the hex code usually start by 010e (magicCrc32c)
0000000 010e 9529 5fbc 0000 0a03 3a0a 6e69 7267
in one of our error event, the bytebuf have about 48 bytes strange data, then continue with normal data
0000000 0000 a610 0000 0000 0200 7239 0000 0000 <== from here
0000020 0200 1339 0000 0000 ea17 a8b0 8b8e fa5e
0000040 2af0 2675 f645 1623 d17e dc34 526d ef44 <=== until here is garbage
0000060 010e 9529 5fbc 0000 0a03 3a0a 6e69 7267 <== from here is normal data
this is just an example, but sometimes the first few bytes are correct and something wrong after few bytes later.
I am still trying to debug when and how the ByteBuf returns incorrect data, and why it only happens during stress testing. It is still not easy to reproduce using the perf tool, but we can 100% reproduce it in our producer code.
Does anyone have any idea what could be causing this issue, and any suggestions on which library or class may have potential issues? Additionally, any suggestions on how to debug this issue or if I need to print any specific information to help identify the root cause would be appreciated. Thank you.
Are you willing to submit a PR?
- [ ] I'm willing to submit a PR!
@semistone Just wondering if this could be related to https://github.com/apache/bookkeeper/pull/4196? There might are also other recent ByteBuf retain/release fixes such as https://github.com/apache/pulsar/pull/22393 . In Bookkeeper, there's https://github.com/apache/bookkeeper/pull/4289 pending release and https://github.com/apache/bookkeeper/pull/4293 is pending review.
we still try to compare what's the different between our producer and perf tool will feedback later once we have any conclusion.
@lhotari we do many tests current broker setting is
maxMessageSize=5242880
and producer setting (small batch message and big max bytes)
batchingMaxMessages: 500
batchingMaxBytes: 3145728
batchingMaxPublishDelayMicros: 500
payload 98% < 3K bytes 2% between 10-20K bytes
then it will show that error and publish throughput isn't good.
but if we change to
batchingMaxMessages: 1000
batchingMaxBytes: 3145728
batchingMaxPublishDelayMicros: 1000
and filter all data bigger than 15K bytes then that error disappear
so we decide to create one batch publisher to publish data < 15000 bytes and one chunk publisher to publish data >= 15000 bytes then it worked and performance is also better than previous test
we still don't known why but at least we have workaround solution now.
I don't how which batch producer configuration could fix this errors. if you have any suggestions, we will still try it .
and we also publish in multi thread programs, seems like it's not directly related to loading but related to payload size but maybe if the publish rate is low, it's more difficult to reproduce .
we also tried to reproduce by perf tool but it didn't always happen.
thanks
I tried to upgrade to bookkeeper 4.17.0 but still have the same issue :(
[pulsar@cockroach308 lib]$ ls |grep bookkeeper
org.apache.bookkeeper-bookkeeper-benchmark-4.17.0.jar
org.apache.bookkeeper-bookkeeper-common-4.17.0.jar
org.apache.bookkeeper-bookkeeper-common-allocator-4.17.0.jar
org.apache.bookkeeper-bookkeeper-perf-4.17.0.jar
org.apache.bookkeeper-bookkeeper-proto-4.17.0.jar
org.apache.bookkeeper-bookkeeper-server-4.17.0.jar
org.apache.bookkeeper-bookkeeper-slogger-api-4.17.0.jar
org.apache.bookkeeper-bookkeeper-slogger-slf4j-4.17.0.jar
org.apache.bookkeeper-bookkeeper-tools-4.17.0.jar
org.apache.bookkeeper-bookkeeper-tools-framework-4.17.0.jar
org.apache.bookkeeper-bookkeeper-tools-ledger-4.17.0.jar
org.apache.bookkeeper-circe-checksum-4.17.0.jar
org.apache.bookkeeper-cpu-affinity-4.17.0.jar
org.apache.bookkeeper.http-http-server-4.17.0.jar
org.apache.bookkeeper.http-vertx-http-server-4.17.0.jar
org.apache.bookkeeper-native-io-4.17.0.jar
org.apache.bookkeeper-statelib-4.17.0.jar
org.apache.bookkeeper.stats-bookkeeper-stats-api-4.17.0.jar
org.apache.bookkeeper.stats-codahale-metrics-provider-4.17.0.jar
org.apache.bookkeeper.stats-otel-metrics-provider-4.17.0.jar
org.apache.bookkeeper.stats-prometheus-metrics-provider-4.17.0.jar
org.apache.bookkeeper-stream-storage-cli-4.17.0.jar
org.apache.bookkeeper-stream-storage-java-client-4.17.0.jar
org.apache.bookkeeper-stream-storage-server-4.17.0.jar
org.apache.bookkeeper-stream-storage-service-api-4.17.0.jar
org.apache.bookkeeper-stream-storage-service-impl-4.17.0.jar
org.apache.bookkeeper.tests-stream-storage-tests-common-4.17.0.jar
org.apache.pulsar-pulsar-package-bookkeeper-storage-3.2.2.jar
we also tried to reproduce by perf tool but it didn't always happen.
@semistone Please share a way how to reproduce it. It's not a problem if it's not always consistent. Fixing this issue will be a lot easier if there's at least some way to reproduce.
I tried to upgrade to bookkeeper 4.17.0 but still have the same issue :(
@semistone Thanks for testing this.
we also tried to reproduce by perf tool but it didn't always happen.
@semistone Please share a way how to reproduce it. It's not a problem if it's not always consistent. Fixing this issue will be a lot easier if there's at least some way to reproduce.
I will try to reproduce in perf tool.
@semistone since you have some way to reproduce this in your own tests, would you be able to test if this can be reproduced with dispatcherDispatchMessagesInSubscriptionThread=false?
https://github.com/apache/pulsar/blob/80d46758e89b088688d521aa8ae401bfb00c98b2/conf/broker.conf#L435-L436
It impacts this code: https://github.com/apache/pulsar/blob/188355b2df08cafd9402e75baf1164ba4b44a052/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/persistent/PersistentDispatcherMultipleConsumers.java#L659-L681
I almost could reproduce by perf tool when very few payload > 30K bytes. others are 3K bytes then error happen when messageKeyGenerationMode=random if without messageKeyGenerationMode, then error disappear
I guess in batch mode, payload size have some restriction.
let me confirm again tomorrow to make sure I didn't make any stupid mistake during my test.
Hi @lhotari I update perf tool in https://github.com/semistone/pulsar/tree/debug_ssues_22601
it only include one commit which modify PerformanceProducer.java to include big payload ( -bp 5 means 5 percent big payload) and BatcherBuilder.KEY_BASED (-kb)
my consumer command is
bin/pulsar-perf consume persistent://my-tenant/my-namespace/my-topic-1 --auth-plugin org.apache.pulsar.client.impl.auth.AuthenticationTls --auth-params '{"tlsCertFile":"conf/superuser.cer","tlsKeyFile":"conf/superuser.key.pem"}' -n 10 -sp Latest -ss angus_test --batch-index-ack -st Key_Shared
and producer command is
bin/pulsar-perf produce persistent://my-tenant/my-namespace/my-topic-1 -r 6000 -kb -s 2000 -bp 5 -bm 1000 -b 1 -mk random --auth-plugin org.apache.pulsar.client.impl.auth.AuthenticationTls --auth-params '{"tlsCertFile":"conf/superuser.cer","tlsKeyFile":"conf/superuser.key.pem"}'
that error happen when
Batch builder is KEY_BASE with random event key and few big payload (in my environment 3% could reproduce 10% will crash producer)
in my test I use normal payload 2K bytes , big payload 20K bytes if I removed any above conditions, that error will either reduced or disappear.
when it happen it will have WARN message in pulsar-broker.log
2024-05-09T01:12:35,246+0000 [pulsar-io-3-31] WARN org.apache.pulsar.broker.service.ServerCnx - [/100.96.184.253:39710] Got exception java.lang.IllegalArgumentException: Invalid unknonwn tag type: 6
or
2024-05-09T01:12:35,260+0000 [broker-topic-workers-OrderedExecutor-15-0] ERROR org.apache.pulsar.common.protocol.Commands - [persistent://budas/budas-preprod-internal/bud_stream_input-partition-1] [angus_test] Failed to peek sticky key from the message metadata
java.lang.IllegalArgumentException: Invalid unknonwn tag type: 4
unfortunately I can't preproduce in docker, I guess docker standalone is different from my pulsar cluster. my pulsar cluster is almost default config but with TLS auth in broker/bookkeeper/zookeeper.
Please help to check it, if have any problem to reproduce this issue in your environment, then I will try to simplify my pulsar cluster to reproduce it.
Thanks
@semistone since you have some way to reproduce this in your own tests, would you be able to test if this can be reproduced with
dispatcherDispatchMessagesInSubscriptionThread=false?
I tested, still the same
@semistone since you have some way to reproduce this in your own tests, would you be able to test if this can be reproduced with
dispatcherDispatchMessagesInSubscriptionThread=false?I tested, still the same
@semistone Thanks for testing. That tells that it's not related to switching the thread in https://github.com/apache/pulsar/blob/188355b2df08cafd9402e75baf1164ba4b44a052/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/persistent/PersistentDispatcherMultipleConsumers.java#L659-L681 .
@lhotari I am checking when that byteBuf went wrong and in OpAddEntry.java I verify data when construct this object and save original data
and during run(), I try to compare and print bytebuf
if (ml.hasActiveCursors()) {
// Avoid caching entries if no cursor has been created
EntryImpl entry = EntryImpl.create(ledgerId, entryId, data);
// ======= print log when data go wrong
if (!Commands.hasChecksum(data)) {
log.warn("no checksum in OpAddEntry1 origin {} data {}, now {} data {}", dataString,
new String(Base64.encodeBase64(orgData))
, data.toString(), new String(Base64.encodeBase64(entry.getData())));
}
it show
2024-05-17T07:29:16,476+0000 [BookKeeperClientWorker-OrderedExecutor-12-0] WARN org.apache.bookkeeper.mledger.impl.OpAddEntry - no checksum in OpAddEntry1
origin PooledSlicedByteBuf(ridx: 13, widx: 2066, cap: 2066/2066, unwrapped: PooledUnsafeDirectByteBuf(ridx: 24211, widx: 24211, cap: 32768))
data DgEJYVBqAAAAKwoMaHlicmlkLTQ0MC0wEI8cGPTz0av4MTILLTEyMDM1NzAyOTNI0A+IAQAxMjM0NTY3ODkxMDExMTIxMzE0MTUxNjE3MTgxOTIwMjEyMjIzMjQyNTI2MjcyODI5MzAzMTMyMzMzNDM1MzYzNzM4Mzk0MDQxNDI0MzQ0NDU0NjQ3NDg0OTUwNTE1MjUzNTQ1NTU2NTc1ODU5NjA2MTYyNjM2NDY1NjY2NzY4Njk3MDcxNzI3Mzc0NzU3Njc3Nzg3OTgwO... skip
,now PooledSlicedByteBuf(ridx: 13, widx: 2066, cap: 2066/2066, unwrapped: PooledUnsafeDirectByteBuf(ridx: 24211, widx: 24211, cap: 32768))
data MzY5NDY5NTY5NjY5NzY5ODY5OTcwMDcwMTcwMjcw0av4MTILLTEyMDM1NzAyOTNI0A+IAQAxMjM0NTY3ODkxMDExMTIxMzE0MTUxNjE3MTgxOTIwMjEyMjIzMjQyNTI2MjcyODI5MzAzMTMyMzMzNDM1MzYzNzM4Mzk0MDQxNDI0MzQ0NDU0NjQ3NDg0OTUwNTE1MjUzNTQ1NTU2NTc1ODU5NjA2MTYyNjM2NDY1NjY2NzY4Njk3MDcxNzI3Mzc0NzU3Njc3Nzg3OT
... skip
the bytebuf object haven't changed, but the data in bytebuf have changed. it seem like the first few bytes(about 20) will be overwrited and rest of the data is still ok
and it's PooledSlicedByteBuf and PooledUnsafeDirectByteBuf
do you have any idea how to find who change the data inside bytebuf ?
I also test again if publish payload always 20K, it won't happen only happen when normal is 2K but some data bigger than 16K( sound like netty receive buffer size but I also try to increase it) and about 1000 qps
I also checked the data again the wrong data look exactly like been overwrited by tail of previous payload
and it not related to batch mode it happened when disable batch mode
do you have any idea how to find who change the data inside bytebuf ?
There's some thought about this in #22110. That's a WIP PR and it might not make sense in the end. Since then, I have found a pattern used in Netty to detect issues. I'll look that up.
unfortunately I can't preproduce in docker, I guess docker standalone is different from my pulsar cluster. my pulsar cluster is almost default config but with TLS auth in broker/bookkeeper/zookeeper.
@semistone do you also use Pulsar Proxy?
unfortunately I can't preproduce in docker, I guess docker standalone is different from my pulsar cluster. my pulsar cluster is almost default config but with TLS auth in broker/bookkeeper/zookeeper.
@semistone do you also use Pulsar Proxy?
@lhotari no, I didn't
error happen when messageKeyGenerationMode=random if without messageKeyGenerationMode, then error disappear
This is a useful detail. When messageKeyGenerationMode is random and BatcherBuilder.KEY_BASED is used, each batch will be size of 1. This could hint that the problem is related to https://github.com/apache/pulsar/pull/16605 changes, main difference here: https://github.com/apache/pulsar/blob/6b2938223cf45a9298f9d40ab6ae108bea9a5a6d/pulsar-client/src/main/java/org/apache/pulsar/client/impl/BatchMessageContainerImpl.java#L252-L286 .
@semistone would it be possible to share your broker.conf customizations? That could help reproduce the issue. I noticed --batch-index-ack in the pulsar-perf command line. I assume that you have at least acknowledgmentAtBatchIndexLevelEnabled=true in broker.conf?
@lhotari
acknowledgmentAtBatchIndexLevelEnabled
Yes, I enabled it and I disable batch mode(-bd) in producer after I found it seem not related to batch mode
and after debugging, I found If I add in OpAddEntry.java
op.data = data.asReadOnly(); << make it read only
then that issue seem disappear. but not sure is there any side affect or not. and I don't known who could touch that bytebuf. It have OpAddEntry.getData and OpAddEntry.setData method, but I don't see anyone touch it.
here is our broker.conf and I remove all of the password broker.conf.zip
I also debug in PulsarDecoder.channelRead print bytebuf object id and compare with the bytebuf in OpAddEntry
I don't see the same bytebuf object been reused during OpAddEntry.createNoRetainBuffer and OpAddEntry.run
then that issue seem disappear. but not sure is there any side affect or not. and I don't known who could touch that bytebuf. It have OpAddEntry.getData and OpAddEntry.setData method, but I don't see anyone touch it.
Interesting detail. does the problem also go away with op.data = data.duplicate(); ?
data.duplicate();
I test it and It seems also work
I will repeat success/failure test later to confirm it again.
I've been trying to reproduce the issue with local microk8s cluster by deploying Pulsar with Apache Pulsar Helm chart using this values file: https://github.com/lhotari/pulsar-playground/blob/master/test-env/issue22601.yaml and these scripts: https://github.com/lhotari/pulsar-playground/tree/master/issues/issue22601 . I haven't yet been able to reproduce.
How many brokers and bookies do you have in the cluster where it reproduces?
I have 6 bookkeeper in 3 different data center and I left only one broker running for debug but I tested only 1 bookkeeper before.
and it run on physical server and CentOS Linux release 7.9.2009
I've been trying to reproduce the issue with local microk8s cluster by deploying Pulsar with Apache Pulsar Helm chart using this values file: https://github.com/lhotari/pulsar-playground/blob/master/test-env/issue22601.yaml and these scripts: https://github.com/lhotari/pulsar-playground/tree/master/issues/issue22601 . I haven't yet been able to reproduce.
I add big payload options you need to use -bp 2 to reproduce it.
-bp 2 to reproduce it.
I was using -bp 5 before, updated that to -bp 2. https://github.com/lhotari/pulsar-playground/commit/63035e9c4ebf656efe12bfcea859743e8ffb8a8c
What gets logged when the issue reproduces?
do you happen to run with debug logging level when the issue reproduces? (just wondering if debug logging code like https://github.com/apache/pulsar/blob/a66ff17b31a01fd9ab151188e9afc9d2de8c141f/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/ServerCnx.java#L1799-L1801 has side effects, as it seems to have in https://github.com/apache/pulsar/blob/a66ff17b31a01fd9ab151188e9afc9d2de8c141f/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/ServerCnx.java#L1859-L1872)
-bp 2 to reproduce it.
I was using
-bp 5before, updated that to-bp 2. lhotari/pulsar-playground@63035e9What gets logged when the issue reproduces?
2024-05-21T08:31:49,202+0000 [broker-topic-workers-OrderedExecutor-5-0] ERROR org.apache.pulsar.common.protocol.Commands - [persistent://my-tenant/my-namespace/my-topic-1] [angus_test] Failed to peek sticky key from the message metadata
java.lang.IllegalStateException: Some required fields are missing