rippled icon indicating copy to clipboard operation
rippled copied to clipboard

Periodically pause copying ledger nodes during online_delete

Open ximinez opened this issue 1 year ago • 4 comments

High Level Overview of Change

Mitigates disk write contention while old ledgers are being deleted, and specifically while the full ledger is being copied to the "new" node store.

Context of Change

PR #4503 (reverted by #4882) attempted to improve rippled performance by writing batches to NuDB asynchronously. However, it had an unintended side effect that when online_delete writes the entire ledger to disk, it tends to cause the buffer to fill up, which results in blocking new ledgers from being persisted.

Type of Change

  • [X] Bug fix (non-breaking change which fixes an issue)
  • [X] Performance (increase or change in throughput and/or latency)

Before / After

Cost: This change will cause online_delete to take significantly longer to copy the full ledger from the old node store to the new one, which is the last significant step in the process. Benefit: rippled should be much less likely to desync, and put less load on the disk during online_delete.

online_delete is still a demanding process, so this won't be a panacea, but it should be a significant improvment. (Significant improvement to be measured.)

Performance details

  1. This is an improvement to existing functionality, which could be considered a bug fix.
  2. The change impacts node store writes. Specifically, it should reduce contention between online_delete and writing new ledgers.
  3. The impact should be measured in a couple of different ways.
    1. rippled should struggle less during online_delete to stay synced, and other functions.
    2. rippled should put less strain/demand on the disk during online_delete.
  4. This change affects concurrent processing, in the sense that multiple threads are writing to the node store, especially during online_delete.

Note that back_off_milliseconds is configurable, defaulting to 100. Node operators can de-prioritize online_delete operations more by increasing this value to whatever they are comfortable with.

ximinez avatar Jan 31 '24 21:01 ximinez

Internal tracker: https://ripplelabs.atlassian.net/browse/RPFC-107

ximinez avatar Jan 31 '24 21:01 ximinez

Codecov Report

Attention: Patch coverage is 40.00000% with 3 lines in your changes missing coverage. Please review.

Project coverage is 74.9%. Comparing base (c19a88f) to head (5d85323).

Files Patch % Lines
src/xrpld/app/misc/SHAMapStoreImp.cpp 25.0% 3 Missing :warning:
Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##           develop   #4907   +/-   ##
=======================================
  Coverage     74.8%   74.9%           
=======================================
  Files          768     768           
  Lines        63134   63138    +4     
  Branches      8867    8850   -17     
=======================================
+ Hits         47244   47265   +21     
+ Misses       15890   15873   -17     
Files Coverage Δ
src/xrpld/app/misc/SHAMapStoreImp.h 96.3% <100.0%> (ø)
src/xrpld/app/misc/SHAMapStoreImp.cpp 74.5% <25.0%> (+<0.1%) :arrow_up:

... and 5 files with indirect coverage changes

Impacted file tree graph

codecov-commenter avatar Jan 31 '24 22:01 codecov-commenter

Note: As of May 28, 2024 - perf testing is still in progress.

intelliot avatar May 31 '24 18:05 intelliot