rippled
rippled copied to clipboard
Improve transaction relay logic
High Level Overview of Change
This PR, if merged, will improve transaction relay logic around a few edge cases.
(I'll write a single commit message later, but before this PR is squashed and merged.)
Context of Change
A few months ago, while examining some of the issues around the 2.0.0 release, and auditing transaction relay code, I identified a few areas with potential for improvement.
Type of Change
- [X] Bug fix (non-breaking change which fixes an issue)
- [X] New feature (non-breaking change which adds functionality)
- [X] Tests (you added tests for code that already exists, or your new feature included in this PR)
Before / After
This PR is divided into four mostly independent changes.
- "Decrease
shouldRelaylimit to 30s." Pretty self-explanatory. Currently, the limit is 5 minutes, by which point theHashRouterentry could have expired, making this transaction look brand new (and thus causing it to be relayed back to peers which have sent it to us recently). - "Give a transaction more chances to be retried." Will put a transaction into
LedgerMaster's held transactions if the transaction gets ater,tel, ortefresult. Old behavior was justter.- Additionally, to prevent a transaction from being repeatedly held indefinitely, it must meet some extra conditions. (Documented in a comment in the code.)
- "Pop all transactions with sequential sequences, or tickets." When a transaction is processed successfully, currently, one held transaction for the same account (if any) will be popped out of the held transactions list, and queued up for the next transaction batch. This change pops all transactions for the account, but only if they have sequential sequences (for non-ticket transactions) or use a ticket. This issue was identified from interactions with @mtrippled's #4504, which was merged, but unfortunately reverted later by #4852. When the batches were spaced out, it could potentially take a very long time for a large number of held transactions for an account to get processed through. However, whether batched or not, this change will help get held transactions cleared out, particularly if a missing earlier transaction is what held them up.
- "Process held transactions through existing NetworkOPs batching." In the current processing, at the end of each consensus round, all held transactions are directly applied to the open ledger, then the held list is reset. This bypasses all of the logic in
NetworkOPs::applywhich, among other things, broadcasts successful transactions to peers. This means that the transaction may not get broadcast to peers for a really long time (5 minutes in the current implementation, or 30 seconds with this first commit). If the node is a bottleneck (either due to network configuration, or because the transaction was submitted locally), the transaction may not be seen by any other nodes or validators before it expires or causes other problems.
Codecov Report
Attention: Patch coverage is 87.60331% with 15 lines in your changes missing coverage. Please review.
Project coverage is 78.1%. Comparing base (
3502df2) to head (0c0362f). Report is 1 commits behind head on develop.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| src/xrpld/app/misc/NetworkOPs.cpp | 82.6% | 15 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## develop #4985 +/- ##
=======================================
Coverage 78.1% 78.1%
=======================================
Files 795 795
Lines 68582 68663 +81
Branches 8278 8283 +5
=======================================
+ Hits 53574 53649 +75
- Misses 15008 15014 +6
| Files with missing lines | Coverage Δ | |
|---|---|---|
| src/xrpld/app/ledger/LocalTxs.h | 100.0% <ø> (ø) |
|
| src/xrpld/app/ledger/detail/LedgerMaster.cpp | 43.9% <100.0%> (-0.1%) |
:arrow_down: |
| src/xrpld/app/ledger/detail/LocalTxs.cpp | 100.0% <100.0%> (ø) |
|
| src/xrpld/app/main/Application.cpp | 69.3% <100.0%> (ø) |
|
| src/xrpld/app/misc/CanonicalTXSet.cpp | 100.0% <100.0%> (ø) |
|
| src/xrpld/app/misc/HashRouter.cpp | 100.0% <100.0%> (ø) |
|
| src/xrpld/app/misc/HashRouter.h | 100.0% <100.0%> (ø) |
|
| src/xrpld/app/misc/NetworkOPs.h | 100.0% <ø> (ø) |
|
| src/xrpld/app/misc/Transaction.h | 100.0% <ø> (ø) |
|
| src/xrpld/app/misc/NetworkOPs.cpp | 69.6% <82.6%> (+0.4%) |
:arrow_up: |
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
- :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.
would you like to open the request to have @sophiax851 take a look at this?
would you like to open the request to have @sophiax851 take a look at this?
Done: https://ripplelabs.atlassian.net/browse/RPFC-173
@ximinez I'm reading the changes in this PR and trying to think if our release payment load testing case would come across the processing path modified by this PR. Based on the highlevel overview:
- The 5mins to 30sec change. I think this change is definitely valid, but the transactions in our test case wouldn't be held that long unless something configured very wrong
- "Give a transaction more chances to be retried." This one won't be invoked either in our load, hopefully we have integration test case to cover the testing in functionality
- "Pop all transactions with sequential sequences, or tickets." To properly test this, we need to somehow simulate a case where a lot of transactions were held for the same account but somehow there's missing sequence(s) among them. This would never happen if our CH node is doing its job, right? Maybe I'm missing something?
- "Process held transactions through existing NetworkOPs batching." I think the only way that we could possibly test this one is to send transactions load directly to the validators? In normal case, transactions being reapplied in the validators were previously received via broadcasting from other nodes, right? Do we want to add additional load to the network if we broadcast these transactions again?
All the changes here are an improvement. There's one open question "does this need amendment ?" and it can be most likely addressed by a PR comment here (assuming the answer is "no", which I feel is likely).
No. 😀 An amendment is not needed for these changes. I am running some "due diligence" on my local node right now to make sure there are no obvious negative side effects like de-syncs.