pulsar icon indicating copy to clipboard operation
pulsar copied to clipboard

[Bug] Deadlock in broker service while initializing bkClient

Open Meet0861 opened this issue 1 year ago • 2 comments

Search before asking

  • [X] I searched in the issues and found nothing similar.

Read release policy

  • [X] I understand that unsupported versions don't get bug fixes. I will attempt to reproduce the issue on a supported version of Pulsar client and Pulsar broker.

Version

2.10.6

Minimal reproduce step

Not able to reproduce. But its happenning in our running clusters intermittently(mostly observed after rollouts) after upgrading from 2.9.3 to 2.10.6

What did you expect to see?

Exception can be thrown with valid reason if any and thread can be released

What did you see instead?

Threads gets blocked and timeouts in produce/consume. Also, faulty broker stopped serving anything and all the bundles unloaded to some other broker.

Exception at Client side: `WARN 8 --- [-client-io-18-4] o.a.p.client.impl.ConnectionHandler : [persistent://tenant/namespace/topic-partition-34] [tenant/namespace] Error connecting to broker: org.apache.pulsar.client.api.PulsarClientException: Connection already closed

2024-04-22T10:29:31.898+05:30 WARN 8 --- [-client-io-18-4] o.a.p.client.impl.ConnectionHandler : [persistent://tenant/namespace/topic-partition-34] [tenant/namespace] Could not get connection to broker: org.apache.pulsar.client.api.PulsarClientException: Connection already closed -- Will try again in 57.264 s`

Anything else?

We have analysed the thread dumps and found a possible deadlock situation. [thread dump] Here, we can see thread metadata-store-10-1 is waiting for 2098 and 2098 is held by pulsar-io-4-7. Pulsar-io-4-7 is not releasing this 2098 as its waiting for d898. Now, what is d898 is stuck at? D898 is stuck at BookieRackAffinityMapping.setConf() and waiting for completable future.

Can this be related to https://github.com/apache/pulsar/pull/20944 ??

Are you willing to submit a PR?

  • [ ] I'm willing to submit a PR!

Meet0861 avatar May 13 '24 05:05 Meet0861

Thanks for the issue report.

Is this similar to #20148 which is fixed by #21096 ?

In a Slack thread I made these comments some time ago:

The deadlock issue might be caused by https://github.com/apache/pulsar/pull/18672 .

It must be a different problem. The thread dump was very useful. the line numbers seemed to match 2.10.6 . One possible way to solve the problem would be to change thenAccept on this line https://github.com/apache/pulsar/blob/c1d8630b13e782935def3c4b12b59ae9aa8e5541/pul[…]c/main/java/org/apache/pulsar/broker/service/BrokerService.java to thenAcceptAsync . That would prevent the metadata store getting blocked and essentially dead locked.

it seems that the same bug is also in the master branch so it will be useful to report it.

lhotari avatar May 13 '24 06:05 lhotari

@lhotari #21096 seems like already cherry-picked in 2.10.

Meet0861 avatar May 13 '24 13:05 Meet0861

This will be fixed by #22846 and #22853

lhotari avatar Jun 05 '24 18:06 lhotari

@lhotari seems like this fix https://github.com/apache/pulsar/pull/22846 is already cherry-picked in branch-3.0 but https://github.com/apache/pulsar/pull/22853 is not and both are tagged with release/3.0.6. Any plans to cherry-pick the #22853 in branch-3.0 soon? when is 3.0.6 release planned?

Meet0861 avatar Jun 19 '24 10:06 Meet0861

@lhotari seems like this fix #22846 is already cherry-picked in branch-3.0 but #22853 is not and both are tagged with release/3.0.6. Any plans to cherry-pick the #22853 in branch-3.0 soon? when is 3.0.6 release planned?

@Meet0861 I completed cherry-picking #22853 to branch-3.0 . Pulsar 3.0.6 release is planned to happen after there's Bookkeeper 4.16.6 available, possibly in the upcoming few weeks.

lhotari avatar Jun 19 '24 11:06 lhotari