rippled icon indicating copy to clipboard operation
rippled copied to clipboard

Pause online delete if there any any gaps in recent ledger history

Open ximinez opened this issue 6 months ago • 3 comments

High Level Overview of Change

Adds another check to the online deletion process to prevent ledger gaps from persisting if the node goes out of sync.

Context of Change

Some nodes struggle to stay in sync when online delete is running due to the added I/O burden. This has proven difficult to avoid without upgrading hardware. If the process takes a particularly long time, which it can with large datasets, the node can end up with many gaps in the recent ledger history. Those gaps will not be filled until the node is idle enough for low-priority historical ledger requests to run.

Note that this is a trade-off. The online deletion may take longer to complete, but the node will stay more up-to-date.

Type of Change

  • [X] New feature (non-breaking change which adds functionality)

ximinez avatar Jul 02 '25 21:07 ximinez

Codecov Report

:x: Patch coverage is 97.50000% with 1 line in your changes missing coverage. Please review. :white_check_mark: Project coverage is 79.1%. Comparing base (41c1be2) to head (950434b).

Files with missing lines Patch % Lines
src/xrpld/app/ledger/detail/LedgerMaster.cpp 90.9% 1 Missing :warning:
Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##           develop   #5531   +/-   ##
=======================================
  Coverage     79.1%   79.1%           
=======================================
  Files          836     836           
  Lines        71245   71278   +33     
  Branches      8324    8311   -13     
=======================================
+ Hits         56358   56393   +35     
+ Misses       14887   14885    -2     
Files with missing lines Coverage Δ
src/xrpld/app/ledger/LedgerMaster.h 73.3% <ø> (ø)
src/xrpld/app/misc/SHAMapStoreImp.cpp 79.5% <100.0%> (+4.1%) :arrow_up:
src/xrpld/app/misc/SHAMapStoreImp.h 95.8% <ø> (ø)
src/xrpld/app/ledger/detail/LedgerMaster.cpp 42.1% <90.9%> (+0.5%) :arrow_up:

... and 4 files with indirect coverage changes

Impacted file tree graph

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Jul 02 '25 23:07 codecov[bot]

I've been running this code locally off and on, and one thing I notice is LedgerMaster's validated ledger index is updated before the corresponding ledger is fully updated and available system-wide (i.e. in complete_ledgers). Since the last validated ledger index is used in the gap detection logic as the end of the expected available range, it sees a gap at the end the range and pauses. For example, gap detection is looking for 100-105, but complete ledgers is 50-104.

I think this is ok, because that processing includes writing the ledger to disk, and is extremely important. So it makes sense to wait for it.

I was concerned that this would cause rotation to stall indefinitely, and while it does pause a lot, it doesn't pause on every ledger, so it does make progress and eventually finish. I am, however, changing the default recoveryWaitTime_ from 5 seconds to 1. That's still a lot of time, the check in between takes almost no time, and it will give rotation a lot more time to run in between ledgers.

ximinez avatar Jul 10 '25 16:07 ximinez

Not sure how, but this branch got some changes from #5644 appended to it. I forced push to remove them and re-merge from develop.

ximinez avatar Aug 25 '25 18:08 ximinez