rippled icon indicating copy to clipboard operation
rippled copied to clipboard

refactor: acquireAsync will dispatch a job, not be dispatched

Open ximinez opened this issue 10 months ago • 3 comments

High Level Overview of Change

Does a few different things:

  1. Adds a new RAII class CanProcess to use where needed for job deduplication.
  2. Adds and improves logging across a variety of classes. Uses the new to_short_string(base_uint) function to help disambiguate some messages about peers and ledgers.
  3. Exits InboundLedger::onTimer without sending new requests if the complete ledger is available. (Duh.)
  4. Changes the interface and usage of InboundLedgers::acquireAsync so that it dispatches a job, and is not called from a dispatched job. This will reduce the number of jobs spun up only to immediately exit.

Context of Change

This is the safe subset of changes from #5301, which has problems, plus the item to dispatch from acquireAsync.

Type of Change

  • [X] Refactor (non-breaking change that only restructures code)
  • [X] Performance (increase or change in throughput and/or latency)
  • [X] Tests (you added tests for code that already exists, or your new feature included in this PR)

API Impact

None

ximinez avatar Feb 28 '25 18:02 ximinez

Codecov Report

:x: Patch coverage is 58.58586% with 41 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 79.5%. Comparing base (92281a4) to head (c77a048).

Files with missing lines Patch % Lines
src/xrpld/app/misc/NetworkOPs.cpp 21.4% 22 Missing :warning:
src/xrpld/app/ledger/detail/InboundLedgers.cpp 54.2% 11 Missing :warning:
src/xrpld/app/ledger/detail/InboundLedger.cpp 25.0% 3 Missing :warning:
src/xrpld/app/ledger/detail/LedgerMaster.cpp 0.0% 3 Missing :warning:
src/xrpld/app/consensus/RCLConsensus.cpp 0.0% 2 Missing :warning:
Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##           develop   #5327   +/-   ##
=======================================
  Coverage     79.5%   79.5%           
=======================================
  Files          816     817    +1     
  Lines        72180   72209   +29     
  Branches      8295    8273   -22     
=======================================
+ Hits         57364   57394   +30     
+ Misses       14816   14815    -1     
Files with missing lines Coverage Δ
include/xrpl/basics/CanProcess.h 100.0% <100.0%> (ø)
include/xrpl/protocol/LedgerHeader.h 100.0% <ø> (ø)
src/xrpld/app/consensus/RCLValidations.cpp 72.9% <100.0%> (-1.6%) :arrow_down:
src/xrpld/app/ledger/InboundLedgers.h 100.0% <ø> (ø)
src/xrpld/app/ledger/detail/TimeoutCounter.cpp 88.4% <100.0%> (-0.3%) :arrow_down:
src/xrpld/app/ledger/detail/TimeoutCounter.h 100.0% <ø> (ø)
src/xrpld/app/misc/HashRouter.h 100.0% <ø> (ø)
src/xrpld/app/misc/NetworkOPs.h 100.0% <ø> (ø)
src/xrpld/app/consensus/RCLConsensus.cpp 62.1% <0.0%> (+0.7%) :arrow_up:
src/xrpld/app/ledger/detail/InboundLedger.cpp 36.7% <25.0%> (-0.2%) :arrow_down:
... and 3 more

... and 2 files with indirect coverage changes

Impacted file tree graph

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Feb 28 '25 20:02 codecov[bot]

Unfortunately, this is not ready for prime time. I ran a simple test over the weekend, running this branch for some time, and running develop for some time.

What the data shows is:

  • This branch ran for about 14 hours. In that time, it went out of sync 239 times. (239 NetworkOPs:WRN STATE->connected - updateOperatingMode: no positions), which is about 17 de-syncs per hour.
  • develop ran for about 48 hours. In that time, it never dropped all the way down to "connected", and it went to "syncing" 43 times. At least of those had to be during initial sync, so call it 42. Either way, that's < 1 de-sync per hour.

Clearly, something is misbehaving, so I introduced a problem somewhere. I may be able to take some time to find out where, but it's not my priority right now, so it may be a while.

$ for f in acquireAsync.txt upstream.txt ; do echo -e "\n$f" ; head -1 $f ; tail -1 $f ; loganalysis $f; done

acquireAsync.txt
2025-Apr-25 22:41:05.433965087 UTC Application:NFO Process starting: rippled-2.4.0+7d8411985cc62926834ac06198422cb101feca0e.DEBUG, Instance Cookie: 17903171231208261479
2025-Apr-26 12:32:38.881102671 UTC NetworkOPs:NFO STATE->full - endConsensus: check full
      4 appeared;  following new file
      1 Application:NFO Process starting: rippled-#+7d8411985cc62926834ac06198422cb101feca0e.DEBUG, Instance Cookie: #
      4 become inaccessible: No such file or directory
      1 NetworkOPs:NFO STATE->connected - Heartbeat: sufficient peers
    291 NetworkOPs:NFO STATE->full - endConsensus: check full
      1 NetworkOPs:NFO STATE->syncing - Heartbeat: check connected
    296 NetworkOPs:NFO STATE->tracking - endConsensus: check tracking
      1 NetworkOPs:WRN STATE->connected - Heartbeat: check syncing
    239 NetworkOPs:WRN STATE->connected - updateOperatingMode: no positions
      2 NetworkOPs:WRN STATE->syncing - check LCL: not on consensus ledger
     25 NetworkOPs:WRN STATE->syncing - consensusViewChange
     29 NetworkOPs:WRN STATE->syncing - updateOperatingMode: no positions

upstream.txt
2025-Apr-26 14:11:56.660217534 UTC Application:NFO Process starting: rippled-2.4.0+fa1e25abef7bf12bde87ba23528346853cef88b0.DEBUG, Instance Cookie: 7137182022843094303
2025-Apr-28 14:21:47.931018040 UTC NetworkOPs:NFO STATE->full
     16 appeared;  following new file
      1 Application:NFO Process starting: rippled-#+fa1e25abef7bf12bde87ba23528346853cef88b0.DEBUG, Instance Cookie: #
     16 become inaccessible: No such file or directory
      1 NetworkOPs:NFO STATE->connected
     40 NetworkOPs:NFO STATE->full
     43 NetworkOPs:NFO STATE->syncing
     43 NetworkOPs:NFO STATE->tracking

ximinez avatar Apr 28 '25 15:04 ximinez

This is ready for review. I accidentally ran the original code in the test mentioned in my previous comment. When I ran it again over the last few days, it had a more de-syncs, but not unreasonably so. About 40 in the last 24 hours. Maybe still enough to be a problem, but not the disaster I observed earlier.

     38 NetworkOPs:NFO STATE->full - endConsensus: check full
     39 NetworkOPs:NFO STATE->tracking - endConsensus: check tracking
      4 NetworkOPs:WRN STATE->connected - updateOperatingMode: no positions
      4 NetworkOPs:WRN STATE->syncing - check LCL: not on consensus ledger
     27 NetworkOPs:WRN STATE->syncing - consensusViewChange
      4 NetworkOPs:WRN STATE->syncing - updateOperatingMode: no positions

ximinez avatar May 01 '25 17:05 ximinez