refactor: acquireAsync will dispatch a job, not be dispatched
High Level Overview of Change
Does a few different things:
- Adds a new RAII class
CanProcessto use where needed for job deduplication. - Adds and improves logging across a variety of classes. Uses the new
to_short_string(base_uint)function to help disambiguate some messages about peers and ledgers. - Exits
InboundLedger::onTimerwithout sending new requests if the complete ledger is available. (Duh.) - Changes the interface and usage of
InboundLedgers::acquireAsyncso that it dispatches a job, and is not called from a dispatched job. This will reduce the number of jobs spun up only to immediately exit.
Context of Change
This is the safe subset of changes from #5301, which has problems, plus the item to dispatch from acquireAsync.
Type of Change
- [X] Refactor (non-breaking change that only restructures code)
- [X] Performance (increase or change in throughput and/or latency)
- [X] Tests (you added tests for code that already exists, or your new feature included in this PR)
API Impact
None
Codecov Report
:x: Patch coverage is 58.58586% with 41 lines in your changes missing coverage. Please review.
:white_check_mark: Project coverage is 79.5%. Comparing base (92281a4) to head (c77a048).
Additional details and impacted files
@@ Coverage Diff @@
## develop #5327 +/- ##
=======================================
Coverage 79.5% 79.5%
=======================================
Files 816 817 +1
Lines 72180 72209 +29
Branches 8295 8273 -22
=======================================
+ Hits 57364 57394 +30
+ Misses 14816 14815 -1
| Files with missing lines | Coverage Δ | |
|---|---|---|
| include/xrpl/basics/CanProcess.h | 100.0% <100.0%> (ø) |
|
| include/xrpl/protocol/LedgerHeader.h | 100.0% <ø> (ø) |
|
| src/xrpld/app/consensus/RCLValidations.cpp | 72.9% <100.0%> (-1.6%) |
:arrow_down: |
| src/xrpld/app/ledger/InboundLedgers.h | 100.0% <ø> (ø) |
|
| src/xrpld/app/ledger/detail/TimeoutCounter.cpp | 88.4% <100.0%> (-0.3%) |
:arrow_down: |
| src/xrpld/app/ledger/detail/TimeoutCounter.h | 100.0% <ø> (ø) |
|
| src/xrpld/app/misc/HashRouter.h | 100.0% <ø> (ø) |
|
| src/xrpld/app/misc/NetworkOPs.h | 100.0% <ø> (ø) |
|
| src/xrpld/app/consensus/RCLConsensus.cpp | 62.1% <0.0%> (+0.7%) |
:arrow_up: |
| src/xrpld/app/ledger/detail/InboundLedger.cpp | 36.7% <25.0%> (-0.2%) |
:arrow_down: |
| ... and 3 more |
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
Unfortunately, this is not ready for prime time. I ran a simple test over the weekend, running this branch for some time, and running develop for some time.
What the data shows is:
- This branch ran for about 14 hours. In that time, it went out of sync 239 times. (
239 NetworkOPs:WRN STATE->connected - updateOperatingMode: no positions), which is about 17 de-syncs per hour. -
developran for about 48 hours. In that time, it never dropped all the way down to "connected", and it went to "syncing" 43 times. At least of those had to be during initial sync, so call it 42. Either way, that's < 1 de-sync per hour.
Clearly, something is misbehaving, so I introduced a problem somewhere. I may be able to take some time to find out where, but it's not my priority right now, so it may be a while.
$ for f in acquireAsync.txt upstream.txt ; do echo -e "\n$f" ; head -1 $f ; tail -1 $f ; loganalysis $f; done
acquireAsync.txt
2025-Apr-25 22:41:05.433965087 UTC Application:NFO Process starting: rippled-2.4.0+7d8411985cc62926834ac06198422cb101feca0e.DEBUG, Instance Cookie: 17903171231208261479
2025-Apr-26 12:32:38.881102671 UTC NetworkOPs:NFO STATE->full - endConsensus: check full
4 appeared; following new file
1 Application:NFO Process starting: rippled-#+7d8411985cc62926834ac06198422cb101feca0e.DEBUG, Instance Cookie: #
4 become inaccessible: No such file or directory
1 NetworkOPs:NFO STATE->connected - Heartbeat: sufficient peers
291 NetworkOPs:NFO STATE->full - endConsensus: check full
1 NetworkOPs:NFO STATE->syncing - Heartbeat: check connected
296 NetworkOPs:NFO STATE->tracking - endConsensus: check tracking
1 NetworkOPs:WRN STATE->connected - Heartbeat: check syncing
239 NetworkOPs:WRN STATE->connected - updateOperatingMode: no positions
2 NetworkOPs:WRN STATE->syncing - check LCL: not on consensus ledger
25 NetworkOPs:WRN STATE->syncing - consensusViewChange
29 NetworkOPs:WRN STATE->syncing - updateOperatingMode: no positions
upstream.txt
2025-Apr-26 14:11:56.660217534 UTC Application:NFO Process starting: rippled-2.4.0+fa1e25abef7bf12bde87ba23528346853cef88b0.DEBUG, Instance Cookie: 7137182022843094303
2025-Apr-28 14:21:47.931018040 UTC NetworkOPs:NFO STATE->full
16 appeared; following new file
1 Application:NFO Process starting: rippled-#+fa1e25abef7bf12bde87ba23528346853cef88b0.DEBUG, Instance Cookie: #
16 become inaccessible: No such file or directory
1 NetworkOPs:NFO STATE->connected
40 NetworkOPs:NFO STATE->full
43 NetworkOPs:NFO STATE->syncing
43 NetworkOPs:NFO STATE->tracking
This is ready for review. I accidentally ran the original code in the test mentioned in my previous comment. When I ran it again over the last few days, it had a more de-syncs, but not unreasonably so. About 40 in the last 24 hours. Maybe still enough to be a problem, but not the disaster I observed earlier.
38 NetworkOPs:NFO STATE->full - endConsensus: check full
39 NetworkOPs:NFO STATE->tracking - endConsensus: check tracking
4 NetworkOPs:WRN STATE->connected - updateOperatingMode: no positions
4 NetworkOPs:WRN STATE->syncing - check LCL: not on consensus ledger
27 NetworkOPs:WRN STATE->syncing - consensusViewChange
4 NetworkOPs:WRN STATE->syncing - updateOperatingMode: no positions