pulsar [Bug] parseMessageMetadata error when broker entry metadata enable with high loading

Search before asking

[X] I searched in the issues and found nothing similar.

Read release policy

[X] I understand that unsupported versions don't get bug fixes. I will attempt to reproduce the issue on a supported version of Pulsar client and Pulsar broker.

Version

3.2.2
3.1.2

Minimal reproduce step

publish event in about 6k QPS and 100Mbits/sec with metaData BatcherBuilder.KEY_BASED mode and producer and send message by high concurrent/parallel producer process. it happens only in almost real time consumer (almost zero backlog)

What did you expect to see?

no lost event

What did you see instead?

could see error log in broker and show Failed to peek sticky key from the message metadata

it look like thread safe issue, because it happen randomly. in 1M events, it only happen few times but the consumer will lose few events

Anything else?

the error similar to https://github.com/apache/pulsar/issues/10967 but I think it's different issue.

the data in bookkeeper is correct. I can download the event from bookkeeper and parse it successfully. or consume the same event few minutes later and it could consume successfully. but all subscriptions will get the same error in the same event in real time consumer(zero backlog).

I have traced source code. it happens in PersistentDispatcherMultipleConsumers.readEntriesComplete -> AbstractBaseDispatcher.filterEntriesForConsumer -> Commands.peekAndCopyMessageMetadata

and I also print the ByteBuf contents, it's I could clearly see the data isn't the same in bookkeeper

in normal event , the hex code usually start by 010e (magicCrc32c)

0000000      010e    9529    5fbc    0000    0a03    3a0a    6e69    7267

in one of our error event, the bytebuf have about 48 bytes strange data, then continue with normal data

0000000      0000    a610    0000    0000    0200    7239    0000    0000 <== from here
     
0000020      0200    1339    0000    0000    ea17    a8b0    8b8e    fa5e
        
0000040      2af0    2675    f645    1623    d17e    dc34    526d    ef44 <=== until here is garbage
          
0000060      010e    9529    5fbc    0000    0a03    3a0a    6e69    7267 <== from here is normal data

this is just an example, but sometimes the first few bytes are correct and something wrong after few bytes later.

I am still trying to debug when and how the ByteBuf returns incorrect data, and why it only happens during stress testing. It is still not easy to reproduce using the perf tool, but we can 100% reproduce it in our producer code.

Does anyone have any idea what could be causing this issue, and any suggestions on which library or class may have potential issues? Additionally, any suggestions on how to debug this issue or if I need to print any specific information to help identify the root cause would be appreciated. Thank you.

Are you willing to submit a PR?

[ ] I'm willing to submit a PR!

Apr 26 '24 09:04 semistone

@semistone Just wondering if this could be related to https://github.com/apache/bookkeeper/pull/4196? There might are also other recent ByteBuf retain/release fixes such as https://github.com/apache/pulsar/pull/22393 . In Bookkeeper, there's https://github.com/apache/bookkeeper/pull/4289 pending release and https://github.com/apache/bookkeeper/pull/4293 is pending review.

Apr 26 '24 11:04 lhotari

we still try to compare what's the different between our producer and perf tool will feedback later once we have any conclusion.

May 01 '24 05:05 semistone

@lhotari we do many tests current broker setting is

maxMessageSize=5242880

and producer setting (small batch message and big max bytes)

batchingMaxMessages: 500
batchingMaxBytes: 3145728
batchingMaxPublishDelayMicros: 500

payload 98% < 3K bytes 2% between 10-20K bytes

then it will show that error and publish throughput isn't good.

but if we change to

batchingMaxMessages: 1000
batchingMaxBytes: 3145728
batchingMaxPublishDelayMicros: 1000

and filter all data bigger than 15K bytes then that error disappear

so we decide to create one batch publisher to publish data < 15000 bytes and one chunk publisher to publish data >= 15000 bytes then it worked and performance is also better than previous test

we still don't known why but at least we have workaround solution now.

I don't how which batch producer configuration could fix this errors. if you have any suggestions, we will still try it .

and we also publish in multi thread programs, seems like it's not directly related to loading but related to payload size but maybe if the publish rate is low, it's more difficult to reproduce .

we also tried to reproduce by perf tool but it didn't always happen.

thanks

May 02 '24 07:05 semistone

I tried to upgrade to bookkeeper 4.17.0 but still have the same issue :(

[pulsar@cockroach308 lib]$ ls |grep bookkeeper
org.apache.bookkeeper-bookkeeper-benchmark-4.17.0.jar
org.apache.bookkeeper-bookkeeper-common-4.17.0.jar
org.apache.bookkeeper-bookkeeper-common-allocator-4.17.0.jar
org.apache.bookkeeper-bookkeeper-perf-4.17.0.jar
org.apache.bookkeeper-bookkeeper-proto-4.17.0.jar
org.apache.bookkeeper-bookkeeper-server-4.17.0.jar
org.apache.bookkeeper-bookkeeper-slogger-api-4.17.0.jar
org.apache.bookkeeper-bookkeeper-slogger-slf4j-4.17.0.jar
org.apache.bookkeeper-bookkeeper-tools-4.17.0.jar
org.apache.bookkeeper-bookkeeper-tools-framework-4.17.0.jar
org.apache.bookkeeper-bookkeeper-tools-ledger-4.17.0.jar
org.apache.bookkeeper-circe-checksum-4.17.0.jar
org.apache.bookkeeper-cpu-affinity-4.17.0.jar
org.apache.bookkeeper.http-http-server-4.17.0.jar
org.apache.bookkeeper.http-vertx-http-server-4.17.0.jar
org.apache.bookkeeper-native-io-4.17.0.jar
org.apache.bookkeeper-statelib-4.17.0.jar
org.apache.bookkeeper.stats-bookkeeper-stats-api-4.17.0.jar
org.apache.bookkeeper.stats-codahale-metrics-provider-4.17.0.jar
org.apache.bookkeeper.stats-otel-metrics-provider-4.17.0.jar
org.apache.bookkeeper.stats-prometheus-metrics-provider-4.17.0.jar
org.apache.bookkeeper-stream-storage-cli-4.17.0.jar
org.apache.bookkeeper-stream-storage-java-client-4.17.0.jar
org.apache.bookkeeper-stream-storage-server-4.17.0.jar
org.apache.bookkeeper-stream-storage-service-api-4.17.0.jar
org.apache.bookkeeper-stream-storage-service-impl-4.17.0.jar
org.apache.bookkeeper.tests-stream-storage-tests-common-4.17.0.jar
org.apache.pulsar-pulsar-package-bookkeeper-storage-3.2.2.jar

May 08 '24 06:05 semistone

we also tried to reproduce by perf tool but it didn't always happen.

@semistone Please share a way how to reproduce it. It's not a problem if it's not always consistent. Fixing this issue will be a lot easier if there's at least some way to reproduce.

May 08 '24 06:05 lhotari

I tried to upgrade to bookkeeper 4.17.0 but still have the same issue :(

@semistone Thanks for testing this.

May 08 '24 06:05 lhotari

we also tried to reproduce by perf tool but it didn't always happen.

@semistone Please share a way how to reproduce it. It's not a problem if it's not always consistent. Fixing this issue will be a lot easier if there's at least some way to reproduce.

I will try to reproduce in perf tool.

May 08 '24 07:05 semistone

@semistone since you have some way to reproduce this in your own tests, would you be able to test if this can be reproduced with dispatcherDispatchMessagesInSubscriptionThread=false? https://github.com/apache/pulsar/blob/80d46758e89b088688d521aa8ae401bfb00c98b2/conf/broker.conf#L435-L436

It impacts this code: https://github.com/apache/pulsar/blob/188355b2df08cafd9402e75baf1164ba4b44a052/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/persistent/PersistentDispatcherMultipleConsumers.java#L659-L681

May 08 '24 08:05 lhotari

I almost could reproduce by perf tool when very few payload > 30K bytes. others are 3K bytes then error happen when messageKeyGenerationMode=random if without messageKeyGenerationMode, then error disappear

I guess in batch mode, payload size have some restriction.

let me confirm again tomorrow to make sure I didn't make any stupid mistake during my test.

May 08 '24 13:05 semistone

Hi @lhotari I update perf tool in https://github.com/semistone/pulsar/tree/debug_ssues_22601

it only include one commit which modify PerformanceProducer.java to include big payload ( -bp 5 means 5 percent big payload) and BatcherBuilder.KEY_BASED (-kb)

my consumer command is

bin/pulsar-perf  consume persistent://my-tenant/my-namespace/my-topic-1   --auth-plugin org.apache.pulsar.client.impl.auth.AuthenticationTls --auth-params '{"tlsCertFile":"conf/superuser.cer","tlsKeyFile":"conf/superuser.key.pem"}' -n 10 -sp Latest -ss angus_test --batch-index-ack   -st Key_Shared

and producer command is

bin/pulsar-perf produce persistent://my-tenant/my-namespace/my-topic-1 -r 6000 -kb -s 2000 -bp 5  -bm 1000  -b 1 -mk random  --auth-plugin org.apache.pulsar.client.impl.auth.AuthenticationTls --auth-params '{"tlsCertFile":"conf/superuser.cer","tlsKeyFile":"conf/superuser.key.pem"}'

that error happen when

Batch builder is KEY_BASE with random event key and few big payload (in my environment 3% could reproduce 10% will crash producer)

in my test I use normal payload 2K bytes , big payload 20K bytes if I removed any above conditions, that error will either reduced or disappear.

when it happen it will have WARN message in pulsar-broker.log

2024-05-09T01:12:35,246+0000 [pulsar-io-3-31] WARN  org.apache.pulsar.broker.service.ServerCnx - [/100.96.184.253:39710] Got exception java.lang.IllegalArgumentException: Invalid unknonwn tag type: 6
or 
2024-05-09T01:12:35,260+0000 [broker-topic-workers-OrderedExecutor-15-0] ERROR org.apache.pulsar.common.protocol.Commands - [persistent://budas/budas-preprod-internal/bud_stream_input-partition-1] [angus_test] Failed to peek sticky key from the message metadata
java.lang.IllegalArgumentException: Invalid unknonwn tag type: 4

unfortunately I can't preproduce in docker, I guess docker standalone is different from my pulsar cluster. my pulsar cluster is almost default config but with TLS auth in broker/bookkeeper/zookeeper.

Please help to check it, if have any problem to reproduce this issue in your environment, then I will try to simplify my pulsar cluster to reproduce it.

Thanks

May 09 '24 05:05 semistone

@semistone since you have some way to reproduce this in your own tests, would you be able to test if this can be reproduced with dispatcherDispatchMessagesInSubscriptionThread=false?

I tested, still the same

May 09 '24 08:05 semistone

@semistone since you have some way to reproduce this in your own tests, would you be able to test if this can be reproduced with dispatcherDispatchMessagesInSubscriptionThread=false?

I tested, still the same

@semistone Thanks for testing. That tells that it's not related to switching the thread in https://github.com/apache/pulsar/blob/188355b2df08cafd9402e75baf1164ba4b44a052/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/persistent/PersistentDispatcherMultipleConsumers.java#L659-L681 .

May 09 '24 08:05 lhotari

@lhotari I am checking when that byteBuf went wrong and in OpAddEntry.java I verify data when construct this object and save original data

and during run(), I try to compare and print bytebuf

        if (ml.hasActiveCursors()) {
            // Avoid caching entries if no cursor has been created
            EntryImpl entry = EntryImpl.create(ledgerId, entryId, data);
            // ======= print log when data go wrong
            if (!Commands.hasChecksum(data)) {  
                log.warn("no checksum in OpAddEntry1 origin {} data  {}, now {} data {}", dataString,
                        new String(Base64.encodeBase64(orgData))
                        , data.toString(), new String(Base64.encodeBase64(entry.getData())));
            }

it show

2024-05-17T07:29:16,476+0000 [BookKeeperClientWorker-OrderedExecutor-12-0] WARN  org.apache.bookkeeper.mledger.impl.OpAddEntry - no checksum in OpAddEntry1 
origin PooledSlicedByteBuf(ridx: 13, widx: 2066, cap: 2066/2066, unwrapped: PooledUnsafeDirectByteBuf(ridx: 24211, widx: 24211, cap: 32768)) 

data  DgEJYVBqAAAAKwoMaHlicmlkLTQ0MC0wEI8cGPTz0av4MTILLTEyMDM1NzAyOTNI0A+IAQAxMjM0NTY3ODkxMDExMTIxMzE0MTUxNjE3MTgxOTIwMjEyMjIzMjQyNTI2MjcyODI5MzAzMTMyMzMzNDM1MzYzNzM4Mzk0MDQxNDI0MzQ0NDU0NjQ3NDg0OTUwNTE1MjUzNTQ1NTU2NTc1ODU5NjA2MTYyNjM2NDY1NjY2NzY4Njk3MDcxNzI3Mzc0NzU3Njc3Nzg3OTgwO... skip

,now PooledSlicedByteBuf(ridx: 13, widx: 2066, cap: 2066/2066, unwrapped: PooledUnsafeDirectByteBuf(ridx: 24211, widx: 24211, cap: 32768)) 

data MzY5NDY5NTY5NjY5NzY5ODY5OTcwMDcwMTcwMjcw0av4MTILLTEyMDM1NzAyOTNI0A+IAQAxMjM0NTY3ODkxMDExMTIxMzE0MTUxNjE3MTgxOTIwMjEyMjIzMjQyNTI2MjcyODI5MzAzMTMyMzMzNDM1MzYzNzM4Mzk0MDQxNDI0MzQ0NDU0NjQ3NDg0OTUwNTE1MjUzNTQ1NTU2NTc1ODU5NjA2MTYyNjM2NDY1NjY2NzY4Njk3MDcxNzI3Mzc0NzU3Njc3Nzg3OT
... skip

the bytebuf object haven't changed, but the data in bytebuf have changed. it seem like the first few bytes(about 20) will be overwrited and rest of the data is still ok

and it's PooledSlicedByteBuf and PooledUnsafeDirectByteBuf

do you have any idea how to find who change the data inside bytebuf ?

May 17 '24 07:05 semistone

I also test again if publish payload always 20K, it won't happen only happen when normal is 2K but some data bigger than 16K( sound like netty receive buffer size but I also try to increase it) and about 1000 qps

I also checked the data again the wrong data look exactly like been overwrited by tail of previous payload

and it not related to batch mode it happened when disable batch mode

May 17 '24 08:05 semistone

do you have any idea how to find who change the data inside bytebuf ?

There's some thought about this in #22110. That's a WIP PR and it might not make sense in the end. Since then, I have found a pattern used in Netty to detect issues. I'll look that up.

May 17 '24 09:05 lhotari

unfortunately I can't preproduce in docker, I guess docker standalone is different from my pulsar cluster. my pulsar cluster is almost default config but with TLS auth in broker/bookkeeper/zookeeper.

@semistone do you also use Pulsar Proxy?

May 18 '24 11:05 lhotari

unfortunately I can't preproduce in docker, I guess docker standalone is different from my pulsar cluster. my pulsar cluster is almost default config but with TLS auth in broker/bookkeeper/zookeeper.

@semistone do you also use Pulsar Proxy?

@lhotari no, I didn't

May 18 '24 11:05 semistone

error happen when messageKeyGenerationMode=random if without messageKeyGenerationMode, then error disappear

This is a useful detail. When messageKeyGenerationMode is random and BatcherBuilder.KEY_BASED is used, each batch will be size of 1. This could hint that the problem is related to https://github.com/apache/pulsar/pull/16605 changes, main difference here: https://github.com/apache/pulsar/blob/6b2938223cf45a9298f9d40ab6ae108bea9a5a6d/pulsar-client/src/main/java/org/apache/pulsar/client/impl/BatchMessageContainerImpl.java#L252-L286 .

May 19 '24 13:05 lhotari

@semistone would it be possible to share your broker.conf customizations? That could help reproduce the issue. I noticed --batch-index-ack in the pulsar-perf command line. I assume that you have at least acknowledgmentAtBatchIndexLevelEnabled=true in broker.conf?

May 21 '24 06:05 lhotari

@lhotari

acknowledgmentAtBatchIndexLevelEnabled

Yes, I enabled it and I disable batch mode(-bd) in producer after I found it seem not related to batch mode

and after debugging, I found If I add in OpAddEntry.java



        op.data = data.asReadOnly(); << make it read  only

then that issue seem disappear. but not sure is there any side affect or not. and I don't known who could touch that bytebuf. It have OpAddEntry.getData and OpAddEntry.setData method, but I don't see anyone touch it.

here is our broker.conf and I remove all of the password broker.conf.zip

May 21 '24 07:05 semistone

I also debug in PulsarDecoder.channelRead print bytebuf object id and compare with the bytebuf in OpAddEntry

I don't see the same bytebuf object been reused during OpAddEntry.createNoRetainBuffer and OpAddEntry.run

May 21 '24 07:05 semistone

then that issue seem disappear. but not sure is there any side affect or not. and I don't known who could touch that bytebuf. It have OpAddEntry.getData and OpAddEntry.setData method, but I don't see anyone touch it.

Interesting detail. does the problem also go away with op.data = data.duplicate(); ?

May 21 '24 07:05 lhotari

data.duplicate();

I test it and It seems also work

I will repeat success/failure test later to confirm it again.

May 21 '24 07:05 semistone

I've been trying to reproduce the issue with local microk8s cluster by deploying Pulsar with Apache Pulsar Helm chart using this values file: https://github.com/lhotari/pulsar-playground/blob/master/test-env/issue22601.yaml and these scripts: https://github.com/lhotari/pulsar-playground/tree/master/issues/issue22601 . I haven't yet been able to reproduce.

May 21 '24 07:05 lhotari

How many brokers and bookies do you have in the cluster where it reproduces?

May 21 '24 07:05 lhotari

I have 6 bookkeeper in 3 different data center and I left only one broker running for debug but I tested only 1 bookkeeper before.

and it run on physical server and CentOS Linux release 7.9.2009

May 21 '24 08:05 semistone

I've been trying to reproduce the issue with local microk8s cluster by deploying Pulsar with Apache Pulsar Helm chart using this values file: https://github.com/lhotari/pulsar-playground/blob/master/test-env/issue22601.yaml and these scripts: https://github.com/lhotari/pulsar-playground/tree/master/issues/issue22601 . I haven't yet been able to reproduce.

I add big payload options you need to use -bp 2 to reproduce it.

May 21 '24 08:05 semistone

-bp 2 to reproduce it.

I was using -bp 5 before, updated that to -bp 2. https://github.com/lhotari/pulsar-playground/commit/63035e9c4ebf656efe12bfcea859743e8ffb8a8c

What gets logged when the issue reproduces?

May 21 '24 08:05 lhotari

do you happen to run with debug logging level when the issue reproduces? (just wondering if debug logging code like https://github.com/apache/pulsar/blob/a66ff17b31a01fd9ab151188e9afc9d2de8c141f/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/ServerCnx.java#L1799-L1801 has side effects, as it seems to have in https://github.com/apache/pulsar/blob/a66ff17b31a01fd9ab151188e9afc9d2de8c141f/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/ServerCnx.java#L1859-L1872)

May 21 '24 08:05 lhotari

-bp 2 to reproduce it.

I was using -bp 5 before, updated that to -bp 2. lhotari/pulsar-playground@63035e9

What gets logged when the issue reproduces?

2024-05-21T08:31:49,202+0000 [broker-topic-workers-OrderedExecutor-5-0] ERROR org.apache.pulsar.common.protocol.Commands - [persistent://my-tenant/my-namespace/my-topic-1] [angus_test] Failed to peek sticky key from the message metadata
java.lang.IllegalStateException: Some required fields are missing

May 21 '24 08:05 semistone