pulsar
pulsar copied to clipboard
Broker direct memory leak due to BKNotEnoughBookiesException
Describe the bug
The benchmark test is for the max throughput scenario with 1GB/s message writing and 1GB/s message reading.
Aug 08 17:44:23 ip-10-0-0-158.us-west-2.compute.internal pulsar[25809]: 2022-08-08T17:44:23,714+0000 [BookKeeperClientWorker-OrderedExecutor-20-0] ERROR org.apache.bookkeeper.client.MetadataUpdateLoop - UpdateLoop(ledgerId=223240,loopId=5cca03e4) Exception updating
Aug 08 17:44:23 ip-10-0-0-158.us-west-2.compute.internal pulsar[25809]: org.apache.bookkeeper.client.BKException$BKNotEnoughBookiesException: Not enough non-faulty bookies available
Aug 08 17:44:23 ip-10-0-0-158.us-west-2.compute.internal pulsar[25809]: 2022-08-08T17:44:23,714+0000 [BookKeeperClientWorker-OrderedExecutor-20-0] WARN org.apache.bookkeeper.client.LedgerHandle - [EnsembleChange(ledger:223240, change-id:0000000001)][attempt:1] Exception changing ensemble
Aug 08 17:44:23 ip-10-0-0-158.us-west-2.compute.internal pulsar[25809]: org.apache.bookkeeper.client.BKException$BKNotEnoughBookiesException: Not enough non-faulty bookies available
Aug 08 17:44:23 ip-10-0-0-158.us-west-2.compute.internal pulsar[25809]: 2022-08-08T17:44:23,714+0000 [BookKeeperClientWorker-OrderedExecutor-20-0] ERROR org.apache.bookkeeper.client.LedgerHandle - Closing ledger 223240 due to NotEnoughBookiesException: Not enough non-faulty bookies available
Aug 08 17:44:23 ip-10-0-0-158.us-west-2.compute.internal pulsar[25809]: 2022-08-08T17:44:23,714+0000 [BookKeeperClientWorker-OrderedExecutor-12-0] ERROR org.apache.bookkeeper.client.MetadataUpdateLoop - UpdateLoop(ledgerId=233784,loopId=0147ac3a) Exception updating
Aug 08 17:44:23 ip-10-0-0-158.us-west-2.compute.internal pulsar[25809]: org.apache.bookkeeper.client.BKException$BKNotEnoughBookiesException: Not enough non-faulty bookies available
Aug 08 17:44:23 ip-10-0-0-158.us-west-2.compute.internal pulsar[25809]: 2022-08-08T17:44:23,714+0000 [BookKeeperClientWorker-OrderedExecutor-12-0] WARN org.apache.bookkeeper.client.LedgerHandle - [EnsembleChange(ledger:233784, change-id:0000000001)][attempt:1] Exception changing ensemble
Aug 08 17:44:23 ip-10-0-0-158.us-west-2.compute.internal pulsar[25809]: org.apache.bookkeeper.client.BKException$BKNotEnoughBookiesException: Not enough non-faulty bookies available
Aug 08 17:44:23 ip-10-0-0-158.us-west-2.compute.internal pulsar[25809]: 2022-08-08T17:44:23,714+0000 [BookKeeperClientWorker-OrderedExecutor-12-0] ERROR org.apache.bookkeeper.client.LedgerHandle - Closing ledger 233784 due to NotEnoughBookiesException: Not enough non-faulty bookies available
Aug 08 17:44:23 ip-10-0-0-158.us-west-2.compute.internal pulsar[25809]: 2022-08-08T17:44:23,752+0000 [BookKeeperClientWorker-OrderedExecutor-10-0] ERROR org.apache.bookkeeper.client.MetadataUpdateLoop - UpdateLoop(ledgerId=238581,loopId=3062cb0d) Exception updating
Aug 08 17:44:23 ip-10-0-0-158.us-west-2.compute.internal pulsar[25809]: org.apache.bookkeeper.client.BKException$BKNotEnoughBookiesException: Not enough non-faulty bookies available
Aug 08 17:44:23 ip-10-0-0-158.us-west-2.compute.internal pulsar[25809]: 2022-08-08T17:44:23,752+0000 [BookKeeperClientWorker-OrderedExecutor-10-0] WARN org.apache.bookkeeper.client.LedgerHandle - [EnsembleChange(ledger:238581, change-id:0000000001)][attempt:1] Exception changing ensemble
Aug 08 17:44:23 ip-10-0-0-158.us-west-2.compute.internal pulsar[25809]: org.apache.bookkeeper.client.BKException$BKNotEnoughBookiesException: Not enough non-faulty bookies available
Aug 08 17:44:23 ip-10-0-0-158.us-west-2.compute.internal pulsar[25809]: 2022-08-08T17:44:23,752+0000 [BookKeeperClientWorker-OrderedExecutor-10-0] ERROR org.apache.bookkeeper.client.LedgerHandle - Closing ledger 238581 due to NotEnoughBookiesException: Not enough non-faulty bookies available
Aug 08 17:44:23 ip-10-0-0-158.us-west-2.compute.internal pulsar[25809]: 2022-08-08T17:44:23,777+0000 [BookKeeperClientWorker-OrderedExecutor-7-0] ERROR org.apache.bookkeeper.client.MetadataUpdateLoop - UpdateLoop(ledgerId=259791,loopId=15db4235) Exception updating
Aug 08 17:44:23 ip-10-0-0-158.us-west-2.compute.internal pulsar[25809]: org.apache.bookkeeper.client.BKException$BKNotEnoughBookiesException: Not enough non-faulty bookies available
Aug 08 17:44:23 ip-10-0-0-158.us-west-2.compute.internal pulsar[25809]: 2022-08-08T17:44:23,777+0000 [BookKeeperClientWorker-OrderedExecutor-7-0] WARN org.apache.bookkeeper.client.LedgerHandle - [EnsembleChange(ledger:259791, change-id:0000000001)][attempt:1] Exception changing ensemble

The second broker does not have the above error logs.
Additional context Build from branch-2.11 with bookkeeper 4.15.1
@codelipenghui hi,I am also verifying the same scenario. If one of the three bookies has a large network delay or a slow response, the direct memory will rise and finally OOM will occur. A large number of memory leaks are found through io.netty.leakDetectionLevel
. Adding back pressure will delay the time when OOM occurs, but it cannot fundamentally solve the problem. Is there a specific reason for this issue? Is there a solution?
@codelipenghui hi,I am also verifying the same scenario. If one of the three bookies has a large network delay or a slow response, the direct memory will rise and finally OOM will occur. A large number of memory leaks are found through
io.netty.leakDetectionLevel
. Adding back pressure will delay the time when OOM occurs, but it cannot fundamentally solve the problem. Is there a specific reason for this issue? Is there a solution?
We are working on this issue. If there are some updates, will leave a comment here.
The issue had no activity for 30 days, mark with Stale label.
Do you have any updates on this OOM issue? I have been tracking similar issues recently, and I am thinking we probably need to improve Pulsar's back pressure implementation. Specifically, if the bookie cannot handle the broker's load, then the broker should provide back pressure to producers/consumers so that the broker works without running out of memory.
One side effect of decreasing overall throughput in order to protect broker memory could be harder load balancing logic because the broker might not look under pressure, but in a sense, it is, and there might be other brokers that could handle higher throughput, depending on the bookkeeper topology.