rippled icon indicating copy to clipboard operation
rippled copied to clipboard

Improve transaction relay logic

Open ximinez opened this issue 1 year ago • 4 comments
trafficstars

High Level Overview of Change

This PR, if merged, will improve transaction relay logic around a few edge cases.

(I'll write a single commit message later, but before this PR is squashed and merged.)

Context of Change

A few months ago, while examining some of the issues around the 2.0.0 release, and auditing transaction relay code, I identified a few areas with potential for improvement.

Type of Change

  • [X] Bug fix (non-breaking change which fixes an issue)
  • [X] New feature (non-breaking change which adds functionality)
  • [X] Tests (you added tests for code that already exists, or your new feature included in this PR)

Before / After

This PR is divided into four mostly independent changes.

  1. "Decrease shouldRelay limit to 30s." Pretty self-explanatory. Currently, the limit is 5 minutes, by which point the HashRouter entry could have expired, making this transaction look brand new (and thus causing it to be relayed back to peers which have sent it to us recently).
  2. "Give a transaction more chances to be retried." Will put a transaction into LedgerMaster's held transactions if the transaction gets a ter, tel, or tef result. Old behavior was just ter.
    • Additionally, to prevent a transaction from being repeatedly held indefinitely, it must meet some extra conditions. (Documented in a comment in the code.)
  3. "Pop all transactions with sequential sequences, or tickets." When a transaction is processed successfully, currently, one held transaction for the same account (if any) will be popped out of the held transactions list, and queued up for the next transaction batch. This change pops all transactions for the account, but only if they have sequential sequences (for non-ticket transactions) or use a ticket. This issue was identified from interactions with @mtrippled's #4504, which was merged, but unfortunately reverted later by #4852. When the batches were spaced out, it could potentially take a very long time for a large number of held transactions for an account to get processed through. However, whether batched or not, this change will help get held transactions cleared out, particularly if a missing earlier transaction is what held them up.
  4. "Process held transactions through existing NetworkOPs batching." In the current processing, at the end of each consensus round, all held transactions are directly applied to the open ledger, then the held list is reset. This bypasses all of the logic in NetworkOPs::apply which, among other things, broadcasts successful transactions to peers. This means that the transaction may not get broadcast to peers for a really long time (5 minutes in the current implementation, or 30 seconds with this first commit). If the node is a bottleneck (either due to network configuration, or because the transaction was submitted locally), the transaction may not be seen by any other nodes or validators before it expires or causes other problems.

ximinez avatar Apr 10 '24 23:04 ximinez

Codecov Report

Attention: Patch coverage is 87.60331% with 15 lines in your changes missing coverage. Please review.

Project coverage is 78.1%. Comparing base (3502df2) to head (0c0362f). Report is 1 commits behind head on develop.

Files with missing lines Patch % Lines
src/xrpld/app/misc/NetworkOPs.cpp 82.6% 15 Missing :warning:
Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##           develop   #4985   +/-   ##
=======================================
  Coverage     78.1%   78.1%           
=======================================
  Files          795     795           
  Lines        68582   68663   +81     
  Branches      8278    8283    +5     
=======================================
+ Hits         53574   53649   +75     
- Misses       15008   15014    +6     
Files with missing lines Coverage Δ
src/xrpld/app/ledger/LocalTxs.h 100.0% <ø> (ø)
src/xrpld/app/ledger/detail/LedgerMaster.cpp 43.9% <100.0%> (-0.1%) :arrow_down:
src/xrpld/app/ledger/detail/LocalTxs.cpp 100.0% <100.0%> (ø)
src/xrpld/app/main/Application.cpp 69.3% <100.0%> (ø)
src/xrpld/app/misc/CanonicalTXSet.cpp 100.0% <100.0%> (ø)
src/xrpld/app/misc/HashRouter.cpp 100.0% <100.0%> (ø)
src/xrpld/app/misc/HashRouter.h 100.0% <100.0%> (ø)
src/xrpld/app/misc/NetworkOPs.h 100.0% <ø> (ø)
src/xrpld/app/misc/Transaction.h 100.0% <ø> (ø)
src/xrpld/app/misc/NetworkOPs.cpp 69.6% <82.6%> (+0.4%) :arrow_up:

... and 5 files with indirect coverage changes

Impacted file tree graph

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

codecov-commenter avatar Apr 11 '24 23:04 codecov-commenter

would you like to open the request to have @sophiax851 take a look at this?

intelliot avatar Jul 08 '24 17:07 intelliot

would you like to open the request to have @sophiax851 take a look at this?

Done: https://ripplelabs.atlassian.net/browse/RPFC-173

ximinez avatar Jul 09 '24 22:07 ximinez

@ximinez I'm reading the changes in this PR and trying to think if our release payment load testing case would come across the processing path modified by this PR. Based on the highlevel overview:

  1. The 5mins to 30sec change. I think this change is definitely valid, but the transactions in our test case wouldn't be held that long unless something configured very wrong
  2. "Give a transaction more chances to be retried." This one won't be invoked either in our load, hopefully we have integration test case to cover the testing in functionality
  3. "Pop all transactions with sequential sequences, or tickets." To properly test this, we need to somehow simulate a case where a lot of transactions were held for the same account but somehow there's missing sequence(s) among them. This would never happen if our CH node is doing its job, right? Maybe I'm missing something?
  4. "Process held transactions through existing NetworkOPs batching." I think the only way that we could possibly test this one is to send transactions load directly to the validators? In normal case, transactions being reapplied in the validators were previously received via broadcasting from other nodes, right? Do we want to add additional load to the network if we broadcast these transactions again?

sophiax851 avatar Jul 11 '24 17:07 sophiax851

All the changes here are an improvement. There's one open question "does this need amendment ?" and it can be most likely addressed by a PR comment here (assuming the answer is "no", which I feel is likely).

No. 😀 An amendment is not needed for these changes. I am running some "due diligence" on my local node right now to make sure there are no obvious negative side effects like de-syncs.

ximinez avatar Apr 28 '25 16:04 ximinez