Issue 3070: Fix bug where checkAllLedgers gets stuck when read throttling is enabled
Motivation
The read throttling feature in Auditor#checkAllLedgers added by #2973, but when it was enabled checkAllLedgers got stuck.
The cause is that, as mentioned in #3070, all the BookKeeperClientWorker-OrderedExecutor threads wait for the semaphore to be released, and there is no thread that can execute EntryExistsCallback#readEntryComplete where the semaphore is released.
Changes
Add a new thread pool to the LedgerChecker class. The BookKeeperClientWorker-OrderedExecutor threads threads delegate subsequent processing (this includes waiting for the semaphore to be released) to threads in this thread pool as soon as they release the semaphore.
This will prevent the BookKeeperClientWorker-OrderedExecutor from getting stuck due to waiting for the semaphore to be released.
Master Issue: #3070
rerun failure checks
@massakam is it possible to avoid the extra executor?
I think the problem is that checkFragments/verifyLedgerFragment tries to acquire a permit from a callback in bookieClient.readEntry call; but there is already a permit request before that call. I think we can simply add a flag to not request/release extra permit in checkFragments/verifyLedgerFragment call chain in this case.
rerun failure checks
rerun failure checks
rerun failure checks
rerun failure checks
@dlg99 Changed the implementation to achieve throttling using RateLimiter instead of Semaphore. What do you think?
I think this is simpler than the implementation of adding a new flag and requesting/releasing a permit according to its value.
@dlg99 Changed the implementation to achieve throttling using
RateLimiterinstead ofSemaphore. What do you think?I think this is simpler than the implementation of adding a new flag and requesting/releasing a permit according to its value.
+1
@dlg99 PTAL
@massakam I think that Throttle is not a good approach here.
Requests-in-progress (RIPs) approach is very similar when everything works but easier to tune and self-adjusts to remote system's dynamically changing capabilities while throttle is impossible to tuen for that. Imagine you configured 20 RIPs and one request takes 100ms normally. Now you have 200 RPS (requests per second). Remote system (bookie) slowed down and now take 500ms to process a request, now you are down to 40RPS without overloading the bookie, changing config etc. Bookie speeds up and can process request in 10ms - the rate dynamically goes up to 2000rps.
Let's use throttle now: set it to 200 request per second. In the normal case everything looks ok and the same as in normal case above. Now the bookie slows down. Throttle keeps on letting through 200 RPS even though the bookie can only handle 40, with that requests will pile up and eventually make result in OOM on bookie or client side. Bookie speeds up and can process request in 10ms - the rate stays at 200rps.
I'd prefer to track down where the semaphore is not released and fix that.
Throttle is simple, does not require release etc. but it has its drawback and was not used there for a reason.
@dlg99 Hmm... you have a point there.
However, I once tried to fix it in the way you suggested first, but I gave it up because the code became complicated and there was a high risk of missing the release of the acquired permit. In this approach, if an additional read is required in EntryExistsCallback#readEntryComplete(), it is necessary not to release the permit immediately, but after the subsequent read is complete.
I think it would be better to move to throttling with a rate limiter rather than to keep the RIPs approach and leaving the possibility of getting stuck.
Another option I have is to revert to the first approach with the extra executor.
PTAL
ping @dlg99 @hangc0276
@massakam When user configured inFlightReadEntryNumInLedgerChecker instead of readEntryRateInLedgerChecker, it will still lead to checkAllLedger gets stuck, right?
@massakam I think that Throttle is not a good approach here.
Requests-in-progress (RIPs) approach is very similar when everything works but easier to tune and self-adjusts to remote system's dynamically changing capabilities while throttle is impossible to tuen for that. Imagine you configured 20 RIPs and one request takes 100ms normally. Now you have 200 RPS (requests per second). Remote system (bookie) slowed down and now take 500ms to process a request, now you are down to 40RPS without overloading the bookie, changing config etc. Bookie speeds up and can process request in 10ms - the rate dynamically goes up to 2000rps.
Let's use throttle now: set it to 200 request per second. In the normal case everything looks ok and the same as in normal case above. Now the bookie slows down. Throttle keeps on letting through 200 RPS even though the bookie can only handle 40, with that requests will pile up and eventually make result in OOM on bookie or client side. Bookie speeds up and can process request in 10ms - the rate stays at 200rps.
I'd prefer to track down where the semaphore is not released and fix that.
Throttle is simple, does not require release etc. but it has its drawback and was not used there for a reason.
+1 for finding out the root cause of semaphore not released and fix that. /cc @massakam
rerun failure checks
@hangc0276 Reverted to the first approach, which is to introduce an extra executor to LedgerChecker. However, because I changed it to use daemon threads instead of user threads, the executor shuts down automatically, and we don't even need to add a shutdown method to LedgerChecker.
I can't think of any other good solution to fix this bug...
PTAL
I think the main issue is #2973 implementation. You should not add a blocking call to throttle unless you have sync API.
The right fix for #2973 is to check the permit and if the process doesn't have a permit then schedule a task to retry and check the permit rather blocking the thread.
so, we should have added tryAcquire() instead of blocking acquire(). with that fix we won't have thread issues and also will not require the extra threads.
fix old workflow,please see #3455 for detail
fix old workflow,please see https://github.com/apache/bookkeeper/issues/3455 for detail
It should have already been fixed.
fix old workflow,please see #3455 for detail
It should have already been fixed.
great,thanks
@hangc0276 have a look, please re-review this pr, thanks
@StevenLuMT I think you are probably referring to a past commit.
@StevenLuMT I think you are probably referring to a past commit.
yes, I have approved this pr, you can hide my old comments @massakam
@massakam @hangc0276 @eolivelli I spent a bit of time trying to understand the root cause and I think I got and idea about what's going on
.
This gist https://gist.github.com/dlg99/505849e1010a20c6d439ecd53f500a85 is more of a demo of what's going on than actual fix (after all, it breaks testShouldGetTwoFrgamentsIfTwoBookiesFailedInSameEnsemble - might be a test issue).
So the problem is that semaphore acquired in the loop (+ in recursive calls) but mostly in the loops like
for (int i = 0; i < writeSet.size(); i++)
and
for (Long entryID: entriesToBeVerified)
e.g. if writeSet is larger than number of semaphore permits, the loop will get stuck because callback will call checkFragments() which calls verifyLedgerFragment() (recursive call) and then we get deadlock.
@massakam I didn't look if we can guess max number of entriesToBeVerified so my workaround was to have a single permit per ReadManyEntriesCallback no matter how many entries it reads. For the writeset's case callback we can do similar thing with counting down number of processed reads. This is the simplest solution I could think of but it will not limit bk reads in progress, rather segments in progress.
other approach could be getting rid of recursion and instead use queue, that will be more work.
It is possible you can come up with a better solution keeping the root cause in mind.
@dlg99
This gist https://gist.github.com/dlg99/505849e1010a20c6d439ecd53f500a85 is more of a demo of what's going on than actual fix (after all, it breaks
testShouldGetTwoFrgamentsIfTwoBookiesFailedInSameEnsemble- might be a test issue).
I have a question about this gist. Why is releasePermit() added on line 79 of LedgerChecker.java?
https://gist.github.com/dlg99/505849e1010a20c6d439ecd53f500a85#file-pr3214-diff-L79
EntryExistsCallback#readEntryComplete also calls releasePermit(), so it seems that one extra permit is released.
https://github.com/apache/bookkeeper/blob/7004d994938e914561d8343a96d9d75ce98b2837/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/LedgerChecker.java#L321
And if releasePermit() on line 75 of LedgerChecker.java is removed, TestLedgerChecker times out.
rerun failure checks
rerun failure checks