rippled icon indicating copy to clipboard operation
rippled copied to clipboard

Reduce peer message traffic for ledger data

Open ximinez opened this issue 1 year ago • 1 comments

High Level Overview of Change

Several changes to help reduce message traffic and improve logging and visibility.

  • Suppress duplicate TMGetLedger and TMLedgerData messages, reducing the overhead of processing those messages.
  • Reduce the number of those messages sent to peers.
  • Improve logging related to those messages, as well as a few other areas.
  • Introduces a new protocol version to gate a new feature on the TMLedgerData message, which allows multiple identical requests to be replied to with one message.

These changes are organized into several commits which are organized logically separating each functional operation. They can be merged as-is, or squashed.

Context of Change

Analysis of the issue that led to #5115 identified heavy TMGetLedger request and TMLedgerData response traffic between nodes leading up to the syncing incidents. It was later determined that those messages were more a symptom of the problem, and not the root cause. However, leading up to identification of the root cause, these changes were being implemented to cut down on those messages, detect duplicates, etc. That reduction in unnecessary traffic is still valuable, so it's being included here.

Type of Change

  • [X] Bug fix (non-breaking change which fixes an issue)
  • [X] New feature (non-breaking change which adds functionality)
  • [X] Performance (increase or change in throughput and/or latency)

API Impact

None.

Test Plan

  1. Verify that nodes are at least as reliable at staying in sync as they are now.
  2. Verify that the indicated messages are sent less frequently by upgraded nodes.
  3. Verify that peers that both have this change send fewer ledger messages to each other.

Future Tasks

I am still working on a follow-up to #4764 that makes use of these changes and other improvements to reduce the number of requests initiated by a given node in the first place.

ximinez avatar Sep 11 '24 13:09 ximinez

Codecov Report

Attention: Patch coverage is 24.34988% with 320 lines in your changes missing coverage. Please review.

Project coverage is 77.9%. Comparing base (7c9d652) to head (5438161). Report is 1 commits behind head on develop.

Files with missing lines Patch % Lines
src/xrpld/overlay/detail/PeerImp.cpp 0.0% 210 Missing :warning:
src/xrpld/overlay/detail/ProtocolMessage.h 0.0% 32 Missing :warning:
src/xrpld/app/misc/NetworkOPs.cpp 21.4% 22 Missing :warning:
src/xrpld/overlay/detail/PeerSet.cpp 20.8% 19 Missing :warning:
src/xrpld/app/ledger/detail/InboundLedgers.cpp 67.9% 17 Missing :warning:
src/xrpld/app/ledger/detail/InboundLedger.cpp 54.5% 5 Missing :warning:
src/xrpld/app/ledger/InboundLedger.h 60.0% 4 Missing :warning:
src/xrpld/app/misc/HashRouter.cpp 60.0% 4 Missing :warning:
src/xrpld/app/ledger/detail/LedgerMaster.cpp 0.0% 3 Missing :warning:
src/xrpld/app/misc/HashRouter.h 77.8% 2 Missing :warning:
... and 2 more
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff            @@
##           develop   #5126     +/-   ##
=========================================
- Coverage     78.2%   77.9%   -0.2%     
=========================================
  Files          790     791      +1     
  Lines        67653   67975    +322     
  Branches      8161    8321    +160     
=========================================
+ Hits         52879   52968     +89     
- Misses       14774   15007    +233     
Files with missing lines Coverage Δ
include/xrpl/basics/base_uint.h 96.9% <100.0%> (+<0.1%) :arrow_up:
include/xrpl/protocol/LedgerHeader.h 100.0% <ø> (ø)
src/xrpld/app/ledger/detail/TimeoutCounter.cpp 88.4% <100.0%> (+1.3%) :arrow_up:
src/xrpld/app/ledger/detail/TimeoutCounter.h 100.0% <ø> (ø)
src/xrpld/app/misc/NetworkOPs.h 100.0% <ø> (ø)
src/xrpld/overlay/Peer.h 100.0% <ø> (ø)
src/xrpld/overlay/detail/PeerImp.h 12.8% <ø> (ø)
src/xrpld/overlay/detail/ProtocolVersion.cpp 86.4% <ø> (ø)
include/xrpl/basics/CanProcess.h 95.5% <95.5%> (ø)
src/xrpld/app/consensus/RCLConsensus.cpp 65.4% <0.0%> (ø)
... and 10 more

... and 6 files with indirect coverage changes

Impacted file tree graph

codecov[bot] avatar Sep 11 '24 14:09 codecov[bot]

@Bronek Any outstanding issues for @ximinez to address?

bthomee avatar Feb 10 '25 17:02 bthomee

@Bronek Any outstanding issues for @ximinez to address?

Yes, just added some. The loop and callback inside critical section can be probably resolved with an explanation e.g. a comment in code. The assert ones are trivial to fix.

Bronek avatar Feb 11 '25 18:02 Bronek

Commit message:

Reduce duplicate peer traffic for ledger data (#5126)

- Drop duplicate outgoing TMGetLedger messages per peer
  - Allow a retry after 30s in case of peer or network congestion.
  - Addresses RIPD-1870
  - (Changes levelization. That is not desirable, and will need to be fixed.)
- Drop duplicate incoming TMGetLedger messages per peer
  - Allow a retry after 15s in case of peer or network congestion.
  - The requestCookie is ignored when computing the hash, thus increasing
    the chances of detecting duplicate messages.
  - With duplicate messages, keep track of the different requestCookies
    (or lack of cookie). When work is finally done for a given request,
    send the response to all the peers that are waiting on the request,
    sending one message per peer, including all the cookies and
    a "directResponse" flag indicating the data is intended for the
    sender, too.
  - Addresses RIPD-1871
- Drop duplicate incoming TMLedgerData messages
  - Addresses RIPD-1869
- Improve logging related to ledger acquisition
- Class "CanProcess" to keep track of processing of distinct items

---------

Co-authored-by: Valentin Balaschenko <[email protected]>

ximinez avatar Feb 13 '25 23:02 ximinez