pulsar icon indicating copy to clipboard operation
pulsar copied to clipboard

[Bug] NPE throwed when recovering ledger: Cannot read field "distributionSchedule" because "this.lh" is null

Open zhaohaidao opened this issue 3 years ago • 8 comments

Search before asking

  • [X] I searched in the issues and found nothing similar.

Version

Pulsar 2.11

Minimal reproduce step

I don't know how to reproduce it

What did you expect to see?

Recover ledger successfully

What did you see instead?

9333 Oct 23 16:55:56 test_host pulsar[123890]: 2022-10-23T16:55:56,889+0800 [BookKeeperClientWorker-OrderedExecutor-1-0] WARN org.apache.pulsar.broker.service.BrokerSe 9333 rvice - Failed to create topic persistent://benchmark/ns-test/test-test-reusing-topic500-0000-partition-426 9334 Oct 23 16:55:56 test_host pulsar[123890]: org.apache.bookkeeper.mledger.ManagedLedgerException: Error while recovering ledger 9335 Oct 23 16:55:56 test_host pulsar[123890]: 2022-10-23T16:55:56,889+0800 [main-EventThread] INFO org.apache.bookkeeper.client.LedgerCreateOp - Ensemble: [10.225.50. 9335 43:3181, 10.224.145.38:3181, 10.225.105.4:3181] for ledger: 24474 9336 Oct 23 16:55:56 test_host pulsar[123890]: 2022-10-23T16:55:56,889+0800 [BookKeeperClientWorker-OrderedExecutor-26-0] ERROR org.apache.bookkeeper.common.util.SafeRu 9336 nnable - Unexpected throwable caught 9337 Oct 23 16:55:56 test_host pulsar[123890]: java.lang.NullPointerException: Cannot read field "distributionSchedule" because "this.lh" is null 9338 Oct 23 16:55:56 test_host pulsar[123890]: at org.apache.bookkeeper.client.PendingAddOp.safeRun(PendingAddOp.java:272) ~[org.apache.bookkeeper-bookkeeper-server-4.1 9338 6.1-SNAPSHOT.jar:4.16.1-SNAPSHOT] 9339 Oct 23 16:55:56 test_host pulsar[123890]: at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36) ~[org.apache.bookkeeper-bookkeeper-common-4.1 9339 6.1-SNAPSHOT.jar:4.16.1-SNAPSHOT] 9340 Oct 23 16:55:56 test_host pulsar[123890]: at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?] 9341 Oct 23 16:55:56 test_host pulsar[123890]: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?] 9342 Oct 23 16:55:56 test_host pulsar[123890]: at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.netty-netty-common-4.1.77.F 9342 inal.jar:4.1.77.Final] 9343 Oct 23 16:55:56 test_host pulsar[123890]: at java.lang.Thread.run(Thread.java:833) ~[?:?]

Anything else?

No response

Are you willing to submit a PR?

  • [X] I'm willing to submit a PR!

zhaohaidao avatar Oct 23 '22 14:10 zhaohaidao

this should be a bk client issue @hangc0276 PTAL

tjiuming avatar Oct 28 '22 03:10 tjiuming

@zhaohaidao I think the issue is not for Pulsar 2.11 right? I noticed the bookkeeper version should be a version that builds on the bookkeeper master branch. I just want to make sure it's not a regression for the 2.11.0 release.

codelipenghui avatar Oct 28 '22 03:10 codelipenghui

From the exception stack, it says the lh is null, but we have already checked it is not null. One possible reason is that the lh has been released, but we need to figure out how it happens. https://github.com/apache/bookkeeper/blob/ef31c7a37415c063b2f485f63198ec28e8f9f0fa/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/PendingAddOp.java#L247-L281

Would you please try to confirm

  1. Whether branch-4.15 has the same problem.
  2. Please try to add more log here and reproduce it to get more information.

hangc0276 avatar Oct 28 '22 05:10 hangc0276

java.lang.NullPointerException

Yes. it is a bookkeeper client issue, and the bkclient built on the master branch. Sorry for submitting to the wrong place.

zhaohaidao avatar Oct 28 '22 09:10 zhaohaidao

I am not sure if this issue is related to #3321, which try to fix ledger handle leak issues

@hangc0276 @codelipenghui

zhaohaidao avatar Oct 28 '22 09:10 zhaohaidao

Is there any error logs? If it is caused by that PR, there should have LedgerClosed or Interrupt errors I think

zymap avatar Oct 28 '22 12:10 zymap

The error log context is as followed. Not sure if it can help. @zymap

11338 Oct 27 18:02:42 test_host pulsar[166925]: 2022-10-27T18:02:42,055+0800 [BookKeeperClientWorker-OrderedExecutor-22-0] WARN org.apache.bookkeeper.client.RackawareEn 11338 semblePlacementPolicyImpl - Failed to choose a bookie: excluded [Bookie:10.225.50.43:3181, Bookie:10.225.105.4:3181, Bookie:10.224.145.38:3181], fallback to choose bookie 11338 randomly from the cluster. 11339 Oct 27 18:02:42 test_host pulsar[166925]: 2022-10-27T18:02:42,055+0800 [BookKeeperClientWorker-OrderedExecutor-22-0] WARN org.apache.bookkeeper.client.RackawareEn 11339 semblePlacementPolicyImpl - Failed to find 1 bookies : excludeBookies [Bookie:10.225.50.43:3181, Bookie:10.225.105.4:3181, Bookie:10.224.145.38:3181], allBookies [<Bookie: 11339 10.225.105.4:3181>, Bookie:10.225.50.43:3181, Bookie:10.224.145.38:3181]. 11340 Oct 27 18:02:42 test_host pulsar[166925]: 2022-10-27T18:02:42,055+0800 [BookKeeperClientWorker-OrderedExecutor-22-0] WARN org.apache.bookkeeper.client.RackawareEn 11340 semblePlacementPolicyImpl - Failed to find 1 bookies : excludeBookies [Bookie:10.225.50.43:3181, Bookie:10.225.105.4:3181, Bookie:10.224.145.38:3181], allBookies [<Bookie: 11340 10.224.145.38:3181>, Bookie:10.225.105.4:3181, Bookie:10.225.50.43:3181]. 11341 Oct 27 18:02:42 test_host pulsar[166925]: 2022-10-27T18:02:42,055+0800 [BookKeeperClientWorker-OrderedExecutor-22-0] WARN org.apache.bookkeeper.client.RackawareEn 11341 semblePlacementPolicyImpl - Failed to choose a bookie: excluded [Bookie:10.225.50.43:3181, Bookie:10.225.105.4:3181, Bookie:10.224.145.38:3181], fallback to choose bookie 11341 randomly from the cluster. 11342 Oct 27 18:02:42 test_host pulsar[166925]: 2022-10-27T18:02:42,055+0800 [BookKeeperClientWorker-OrderedExecutor-22-0] WARN org.apache.bookkeeper.client.RackawareEn 11342 semblePlacementPolicyImpl - Failed to find 1 bookies : excludeBookies [Bookie:10.225.50.43:3181, Bookie:10.225.105.4:3181, Bookie:10.224.145.38:3181], allBookies [<Bookie: 11342 10.225.105.4:3181>, Bookie:10.224.145.38:3181, Bookie:10.225.50.43:3181]. 11343 Oct 27 18:02:42 test_host pulsar[166925]: 2022-10-27T18:02:42,055+0800 [BookKeeperClientWorker-OrderedExecutor-22-0] ERROR org.apache.bookkeeper.client.ReadOnlyLed 11343 gerHandle - Could not get additional bookie to remake ensemble, closing ledger: 3116 11344 Oct 27 18:02:42 test_host pulsar[166925]: 2022-10-27T18:02:42,055+0800 [BookKeeperClientWorker-OrderedExecutor-22-0] ERROR org.apache.bookkeeper.client.PendingAddO 11344 p - Write of ledger entry to quorum failed: L3116 E1032 11345 Oct 27 18:02:42 test_host pulsar[166925]: 2022-10-27T18:02:42,055+0800 [BookKeeperClientWorker-OrderedExecutor-22-0] ERROR org.apache.bookkeeper.client.LedgerRecov 11345 eryOp - Failure NotEnoughBookiesException: Not enough non-faulty bookies available while writing entry: 1033 while recovering ledger: 3116 11346 Oct 27 18:02:42 test_host pulsar[166925]: 2022-10-27T18:02:42,056+0800 [BookKeeperClientWorker-OrderedExecutor-13-0] ERROR org.apache.bookkeeper.mledger.impl.Manag 11346 edLedgerImpl - [benchmark/ns-test/persistent/test-test-reusing-topic500-0000-partition-155] Failed to open ledger 3116: Error while recovering ledger 11347 Oct 27 18:02:42 test_host pulsar[166925]: 2022-10-27T18:02:42,056+0800 [BookKeeperClientWorker-OrderedExecutor-13-0] ERROR org.apache.bookkeeper.mledger.impl.Manag 11347 edLedgerFactoryImpl - [benchmark/ns-test/persistent/test-test-reusing-topic500-0000-partition-155] Failed to initialize managed ledger: Error while recovering ledger 11348 Oct 27 18:02:42 test_host pulsar[166925]: 2022-10-27T18:02:42,056+0800 [BookKeeperClientWorker-OrderedExecutor-13-0] INFO org.apache.bookkeeper.mledger.impl.Manag 11348 edLedgerImpl - [benchmark/ns-test/persistent/test-test-reusing-topic500-0000-partition-155] Closing managed ledger 11349 Oct 27 18:02:42 test_host pulsar[166925]: 2022-10-27T18:02:42,056+0800 [BookKeeperClientWorker-OrderedExecutor-13-0] WARN org.apache.pulsar.broker.service.BrokerS 11349 ervice - Failed to create topic persistent://benchmark/ns-test/test-test-reusing-topic500-0000-partition-155 11350 Oct 27 18:02:42 test_host pulsar[166925]: org.apache.bookkeeper.mledger.ManagedLedgerException: Error while recovering ledger 11351 Oct 27 18:02:42 test_host pulsar[166925]: 2022-10-27T18:02:42,056+0800 [BookKeeperClientWorker-OrderedExecutor-22-0] ERROR org.apache.bookkeeper.common.util.SafeRu 11351 nnable - Unexpected throwable caught 11352 Oct 27 18:02:42 test_host pulsar[166925]: java.lang.NullPointerException: Cannot read field "distributionSchedule" because "this.lh" is null 11353 Oct 27 18:02:42 test_host pulsar[166925]: at org.apache.bookkeeper.client.PendingAddOp.safeRun(PendingAddOp.java:272) ~[org.apache.bookkeeper-bookkeeper-server-4.1 11353 6.1-SNAPSHOT.jar:4.16.1-SNAPSHOT] 11354 Oct 27 18:02:42 test_host pulsar[166925]: at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36) ~[org.apache.bookkeeper-bookkeeper-common-4.1 11354 6.1-SNAPSHOT.jar:4.16.1-SNAPSHOT] 11355 Oct 27 18:02:42 test_host pulsar[166925]: at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?] 11356 Oct 27 18:02:42 test_host pulsar[166925]: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?] 11357 Oct 27 18:02:42 test_host pulsar[166925]: at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.netty-netty-common-4.1.77.F 11357 inal.jar:4.1.77.Final] 11358 Oct 27 18:02:42 test_host pulsar[166925]: at java.lang.Thread.run(Thread.java:833) ~[?:?] 11359 Oct 27 18:02:42 test_host pulsar[166925]: 2022-10-27T18:02:42,056+0800 [BookKeeperClientWorker-OrderedExecutor-22-0] WARN org.apache.bookkeeper.client.LedgerRecov 11359 eryOp - Successfully read entry 1033 for ledger 3116, but readDone is already false

zhaohaidao avatar Oct 28 '22 13:10 zhaohaidao

Does this error still happen? I am not sure about it is related to #3321. In 3321, the recycling happened at the beginning of add entry operation. If it is recycled, the pendingAddOp won't be executed. Because you are using the master code, could you please take the following experiment:

  • Change the code base to branch-4.15 to test if the latest code of 4.15 works well. Currently, the pulsar still using 4.15.x code.
  • Rever the #3321. in the master to check if it can reproduce the same error.

@zhaohaidao

zymap avatar Oct 31 '22 03:10 zymap

Does this error still happen? I am not sure about it is related to #3321. In 3321, the recycling happened at the beginning of add entry operation. If it is recycled, the pendingAddOp won't be executed. Because you are using the master code, could you please take the following experiment:

  • Change the code base to branch-4.15 to test if the latest code of 4.15 works well. Currently, the pulsar still using 4.15.x code.
  • Rever the #3321. in the master to check if it can reproduce the same error.

@zhaohaidao

Sorry for replying late. Through analysis, It can be sure this issue is nothing to do with #3321. And branch 4.15 has the same problem as you expected. Issue analysis can be seen in: https://github.com/apache/bookkeeper/issues/3618

zhaohaidao avatar Nov 07 '22 10:11 zhaohaidao

The issue had no activity for 30 days, mark with Stale label.

github-actions[bot] avatar Dec 08 '22 02:12 github-actions[bot]