bookkeeper Issue 3070: Fix bug where checkAllLedgers gets stuck when read throttling is enabled

Motivation

The read throttling feature in Auditor#checkAllLedgers added by #2973, but when it was enabled checkAllLedgers got stuck.

The cause is that, as mentioned in #3070, all the BookKeeperClientWorker-OrderedExecutor threads wait for the semaphore to be released, and there is no thread that can execute EntryExistsCallback#readEntryComplete where the semaphore is released.

Changes

Add a new thread pool to the LedgerChecker class. The BookKeeperClientWorker-OrderedExecutor threads threads delegate subsequent processing (this includes waiting for the semaphore to be released) to threads in this thread pool as soon as they release the semaphore.

This will prevent the BookKeeperClientWorker-OrderedExecutor from getting stuck due to waiting for the semaphore to be released.

Master Issue: #3070

Apr 18 '22 10:04 massakam

rerun failure checks

Apr 19 '22 02:04 massakam

@massakam is it possible to avoid the extra executor?

I think the problem is that checkFragments/verifyLedgerFragment tries to acquire a permit from a callback in bookieClient.readEntry call; but there is already a permit request before that call. I think we can simply add a flag to not request/release extra permit in checkFragments/verifyLedgerFragment call chain in this case.

Apr 26 '22 00:04 dlg99

rerun failure checks

Apr 26 '22 09:04 massakam

rerun failure checks

Apr 26 '22 09:04 massakam

rerun failure checks

Apr 26 '22 10:04 massakam

rerun failure checks

Apr 26 '22 11:04 massakam

@dlg99 Changed the implementation to achieve throttling using RateLimiter instead of Semaphore. What do you think?

I think this is simpler than the implementation of adding a new flag and requesting/releasing a permit according to its value.

Apr 26 '22 12:04 massakam

@dlg99 Changed the implementation to achieve throttling using RateLimiter instead of Semaphore. What do you think?

I think this is simpler than the implementation of adding a new flag and requesting/releasing a permit according to its value.

+1

May 05 '22 03:05 gaozhangmin

@dlg99 PTAL

May 10 '22 03:05 massakam

@massakam I think that Throttle is not a good approach here.

Requests-in-progress (RIPs) approach is very similar when everything works but easier to tune and self-adjusts to remote system's dynamically changing capabilities while throttle is impossible to tuen for that. Imagine you configured 20 RIPs and one request takes 100ms normally. Now you have 200 RPS (requests per second). Remote system (bookie) slowed down and now take 500ms to process a request, now you are down to 40RPS without overloading the bookie, changing config etc. Bookie speeds up and can process request in 10ms - the rate dynamically goes up to 2000rps.

Let's use throttle now: set it to 200 request per second. In the normal case everything looks ok and the same as in normal case above. Now the bookie slows down. Throttle keeps on letting through 200 RPS even though the bookie can only handle 40, with that requests will pile up and eventually make result in OOM on bookie or client side. Bookie speeds up and can process request in 10ms - the rate stays at 200rps.

I'd prefer to track down where the semaphore is not released and fix that.

Throttle is simple, does not require release etc. but it has its drawback and was not used there for a reason.

Jun 21 '22 16:06 dlg99

@dlg99 Hmm... you have a point there.

However, I once tried to fix it in the way you suggested first, but I gave it up because the code became complicated and there was a high risk of missing the release of the acquired permit. In this approach, if an additional read is required in EntryExistsCallback#readEntryComplete(), it is necessary not to release the permit immediately, but after the subsequent read is complete.

I think it would be better to move to throttling with a rate limiter rather than to keep the RIPs approach and leaving the possibility of getting stuck.

Jun 23 '22 04:06 massakam

Another option I have is to revert to the first approach with the extra executor.

Jun 23 '22 05:06 massakam

PTAL

Jun 25 '22 07:06 massakam

ping @dlg99 @hangc0276

Jul 06 '22 06:07 massakam

@massakam When user configured inFlightReadEntryNumInLedgerChecker instead of readEntryRateInLedgerChecker, it will still lead to checkAllLedger gets stuck, right?

Jul 31 '22 06:07 hangc0276

@massakam I think that Throttle is not a good approach here.

Requests-in-progress (RIPs) approach is very similar when everything works but easier to tune and self-adjusts to remote system's dynamically changing capabilities while throttle is impossible to tuen for that. Imagine you configured 20 RIPs and one request takes 100ms normally. Now you have 200 RPS (requests per second). Remote system (bookie) slowed down and now take 500ms to process a request, now you are down to 40RPS without overloading the bookie, changing config etc. Bookie speeds up and can process request in 10ms - the rate dynamically goes up to 2000rps.

Let's use throttle now: set it to 200 request per second. In the normal case everything looks ok and the same as in normal case above. Now the bookie slows down. Throttle keeps on letting through 200 RPS even though the bookie can only handle 40, with that requests will pile up and eventually make result in OOM on bookie or client side. Bookie speeds up and can process request in 10ms - the rate stays at 200rps.

I'd prefer to track down where the semaphore is not released and fix that.

Throttle is simple, does not require release etc. but it has its drawback and was not used there for a reason.

+1 for finding out the root cause of semaphore not released and fix that. /cc @massakam

Jul 31 '22 06:07 hangc0276

rerun failure checks

Aug 01 '22 08:08 massakam

@hangc0276 Reverted to the first approach, which is to introduce an extra executor to LedgerChecker. However, because I changed it to use daemon threads instead of user threads, the executor shuts down automatically, and we don't even need to add a shutdown method to LedgerChecker.

I can't think of any other good solution to fix this bug...

Aug 01 '22 09:08 massakam

PTAL

Aug 01 '22 09:08 massakam

I think the main issue is #2973 implementation. You should not add a blocking call to throttle unless you have sync API. The right fix for #2973 is to check the permit and if the process doesn't have a permit then schedule a task to retry and check the permit rather blocking the thread. so, we should have added tryAcquire() instead of blocking acquire(). with that fix we won't have thread issues and also will not require the extra threads.

Aug 11 '22 03:08 rdhabalia

fix old workflow,please see #3455 for detail

Aug 24 '22 08:08 StevenLuMT

fix old workflow,please see https://github.com/apache/bookkeeper/issues/3455 for detail

It should have already been fixed.

Aug 25 '22 01:08 massakam

fix old workflow,please see #3455 for detail

It should have already been fixed.

great,thanks

Aug 25 '22 01:08 StevenLuMT

@hangc0276 have a look, please re-review this pr, thanks

Sep 08 '22 07:09 StevenLuMT

@StevenLuMT I think you are probably referring to a past commit.

Sep 09 '22 02:09 massakam

@StevenLuMT I think you are probably referring to a past commit.

yes, I have approved this pr, you can hide my old comments @massakam

Sep 13 '22 02:09 StevenLuMT

@massakam @hangc0276 @eolivelli I spent a bit of time trying to understand the root cause and I think I got and idea about what's going on . This gist https://gist.github.com/dlg99/505849e1010a20c6d439ecd53f500a85 is more of a demo of what's going on than actual fix (after all, it breaks testShouldGetTwoFrgamentsIfTwoBookiesFailedInSameEnsemble - might be a test issue).

So the problem is that semaphore acquired in the loop (+ in recursive calls) but mostly in the loops like

for (int i = 0; i < writeSet.size(); i++)

and

for (Long entryID: entriesToBeVerified)

e.g. if writeSet is larger than number of semaphore permits, the loop will get stuck because callback will call checkFragments() which calls verifyLedgerFragment() (recursive call) and then we get deadlock.

@massakam I didn't look if we can guess max number of entriesToBeVerified so my workaround was to have a single permit per ReadManyEntriesCallback no matter how many entries it reads. For the writeset's case callback we can do similar thing with counting down number of processed reads. This is the simplest solution I could think of but it will not limit bk reads in progress, rather segments in progress.

other approach could be getting rid of recursion and instead use queue, that will be more work.

It is possible you can come up with a better solution keeping the root cause in mind.

Sep 20 '22 23:09 dlg99

@dlg99

This gist https://gist.github.com/dlg99/505849e1010a20c6d439ecd53f500a85 is more of a demo of what's going on than actual fix (after all, it breaks testShouldGetTwoFrgamentsIfTwoBookiesFailedInSameEnsemble - might be a test issue).

I have a question about this gist. Why is releasePermit() added on line 79 of LedgerChecker.java? https://gist.github.com/dlg99/505849e1010a20c6d439ecd53f500a85#file-pr3214-diff-L79

EntryExistsCallback#readEntryComplete also calls releasePermit(), so it seems that one extra permit is released. https://github.com/apache/bookkeeper/blob/7004d994938e914561d8343a96d9d75ce98b2837/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/LedgerChecker.java#L321

And if releasePermit() on line 75 of LedgerChecker.java is removed, TestLedgerChecker times out.

Sep 21 '22 11:09 massakam

rerun failure checks

Sep 21 '22 11:09 massakam

rerun failure checks

Sep 21 '22 12:09 massakam