Fix deadlock in InboundLedgers and NetworkOPs
High Level Overview of Change
Fixes a deadlock bug in 2.2.2 release.
Context of Change
Functions acquireAsync and NetworkOPsImp::recvValidation contains lock.lock() which will deadlock if an exception is thrown while the (non-recursive) mutex is owned by the calling thread, that is before lock.unlock()
In 2.2.2 release we switched some operations from synchronous to asynchronous, guarded by mutexes, but failed to account for exception safety of the critical section, which may result in a deadlock if an exception is thrown before lock.unlock() inside the try block. It is extremely unlikely that a deadlock will actually occur in practice, since the possible exceptions in this section of code are scarce.
The solution adopted in this PR is to move ScopedUnlock from LedgerMaster.cpp to basics/scope.h (adjusting casing to match other utilities in this file, to scope_unlock) and to use this type to create a RAII unlock, removing the problematic lock.lock().
An alternative solution might be to abstract away the "pending operation" with the corresponding mutex, and then use it where appropriate, but I wanted this change to generalise an existing lower-level utility (i.e. scoped unlock) rather than write a new one. The "pending operation" approach could be further generalised into structured asynchronous operations, which would place it in a much larger PR.
Type of Change
- [x] Bug fix (non-breaking change which fixes an issue)
Codecov Report
Attention: Patch coverage is 80.00000% with 3 lines in your changes missing coverage. Please review.
Project coverage is 77.9%. Comparing base (
c5c0e70) to head (b19505a). Report is 2 commits behind head on develop.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| src/xrpld/app/misc/NetworkOPs.cpp | 0.0% | 2 Missing :warning: |
| src/xrpld/app/ledger/detail/LedgerMaster.cpp | 75.0% | 1 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## develop #5124 +/- ##
=========================================
+ Coverage 77.7% 77.9% +0.2%
=========================================
Files 779 782 +3
Lines 66015 66614 +599
Branches 8156 8140 -16
=========================================
+ Hits 51261 51887 +626
+ Misses 14754 14727 -27
| Files with missing lines | Coverage Δ | |
|---|---|---|
| include/xrpl/basics/scope.h | 100.0% <100.0%> (ø) |
|
| src/xrpld/app/ledger/detail/InboundLedgers.cpp | 38.6% <100.0%> (ø) |
|
| src/xrpld/app/ledger/detail/LedgerMaster.cpp | 43.9% <75.0%> (-0.4%) |
:arrow_down: |
| src/xrpld/app/misc/NetworkOPs.cpp | 70.1% <0.0%> (+<0.1%) |
:arrow_up: |
Is it just me, or do these changes not actually change any behavior?
In
InboundLedgers::acquireAsync(), we have this sequence before:
- 142:
unlock().- 143: call
acquire(). If it does not throw, continue to line 157. If it does throw, one of the exception handlers is entered (no exception can skip both handlers), neither throws (they just log warnings), and execution continues to line 157. Either way, we end up at line 157.- 157:
lock().
The change in behaviour is if we have an exception before unlock() at the line 142, e.g. inside insert(). If this happens, then the lock() outside of the try-catch block will deadlock (because the mutex is already owned, by the very same thread).
Now, I prefer a RAII type like
scope_unlock, but I think the only difference in behavior it produces here is that the call tolock()is moved before the exception handler is entered (if one is entered). But that is not a material difference. Is that right?Very similar story for
NetworkOPsImp::recvValidation.
Yes, same applies - there's an insert() before unlock() and if it throws, then we will deadlock.
The RAII unlock_scope fixes both possible deadlocks, because it removes the problematic lock() below the try-catch blocks.
This may be redundant with the
CanProcessclass in #5126. I'm biased, but I prefer my solution because it hides all the locking in a single class. I'm open to being convinced otherwise, though.
you are right that your PR solves this problem, however given the number of other issues that your PR solves, and the associated code churn, it might be a little until it is tested and approved. In the meantime, if we get this merged and, some time after, your PR merged as well, we will eventually end up with unlock_scope moved from LedgerMaster.cpp to scope.h which is in itself (I think) an improvement.