solana
solana copied to clipboard
Exclude partition reward accounts in account hash calculation
Problem
We can run the validator against mainnet with the new partitioned reward code by enabling --partitioned-epoch-rewards-force-enable-single-slot
CLI argument.
Recently, my test node running with the above CLI fails on account hash mismatch. It turns out that the mismatch is coming from the new partition reward account introduced at the epoch, i.e. https://github.com/solana-labs/solana/pull/34624.
In order for this CLI continue to work after https://github.com/solana-labs/solana/pull/34624, we need to ignore the epoch partition reward accounts for account hash calculation.
Summary of Changes
Ignore partition reward accounts for account's hash calculation when we force enable partitioned rewards
Fixes #
I have restarted my test node with this PR against mainnet.
Currently, we are at 10% of epoch 561.
Block height: 223259936
Slot: 242398091
Epoch: 561
Transaction Count: 262218377112
Epoch Slot Range: [242352000..242784000)
Epoch Completed Percent: 10.669%
Epoch Completed Slots: 46091/432000 (385909 remaining)
Epoch Completed Time: 5h 25m 31s/2days 2h 1m 9s (1day 20h 35m 38s remaining)
We will see if this fix works in 2 days.
Codecov Report
Attention: 5 lines
in your changes are missing coverage. Please review.
Comparison is base (
f9bfb60
) 81.7% compared to head (242f79f
) 81.7%. Report is 36 commits behind head on master.
:exclamation: Current head 242f79f differs from pull request most recent head 54918e0. Consider uploading reports for the commit 54918e0 to get more accurate results
Additional details and impacted files
@@ Coverage Diff @@
## master #34809 +/- ##
========================================
Coverage 81.7% 81.7%
========================================
Files 826 825 -1
Lines 223371 223302 -69
========================================
+ Hits 182578 182629 +51
+ Misses 40793 40673 -120
Test node with this PR runs successfully across the epoch boundary!
Block height: 223643554
Slot: 242798190
Epoch: 562
Transaction Count: 262687685231
Epoch Slot Range: [242784000..243216000)
Epoch Completed Percent: 3.285%
Epoch Completed Slots: 14190/432000 (417810 remaining)
Epoch Completed Time: 1h 38m 43s/2days 1h 34m 38s (1day 23h 55m 55s remaining)
Does this bug impact the partitioned epoch rewards feature gate itself, or only the -partitioned-epoch-rewards-force-enable-single-slot
cli flag?
Does this bug impact the partitioned epoch rewards feature gate itself, or only the
-partitioned-epoch-rewards-force-enable-single-slot
cli flag?
No, it desn't. Only when -partitioned-epoch-rewards-force-enable-single-slot
is used before the feature is activated.
Verify that this PR works
1. Running jw14
without the PR results a crash at the epoch boundary.
Slot: 243418619
Epoch: 563
Transaction Count: 263412977100
Epoch Slot Range: [243216000..243648000)
Epoch Completed Percent: 46.903%
Epoch Completed Slots: 202619/432000 (229381 remaining)
Epoch Completed Time: 23h 20m 28s/2days 1h 27m 54s (1day 2h 7m 26s remaining)
thread 'solReplayStage' panicked at core/src/replay_stage.rs:1414:25:
We have tried to repair duplicate slot: 243216000 more than 10 times and are unable to freeze a block with bankhash EYc7sdGmvWAmFyGN5DWhYx6MR8KxFcqag1HoFArNdRDF, instead we have a block with bankhash Some(Fg8G6EjERZmTBSn8rKCdTRqmHVx6JE4BouyRHeQZYnMS). T
his is most likely a bug in the runtime. At this point manual intervention is needed to make progress. Exiting
stack backtrace:
0: rust_begin_unwind
at ./rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:645:5
1: core::panicking::panic_fmt
at ./rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/panicking.rs:72:14
2: hashbrown::map::HashMap<K,V,S,A>::retain
3: solana_core::replay_stage::ReplayStage::dump_then_repair_correct_slots
4: solana_core::replay_stage::ReplayStage::new::{{closure}}
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
[2024-01-21T15:48:42.256475777Z ERROR solana_metrics::metrics] datapoint: panic program="validator" thread="solReplayStage" one=1i message="panicked at core/src/replay_stage.rs:1414:25:
[2024-01-21T15:48:42.256475777Z ERROR solana_metrics::metrics] datapoint: panic program="validator" thread="solReplayStage" one=1i message="panicked at core/src/replay_stage.rs:1414:25:
We have tried to repair duplicate slot: 243216000 more than 10 times and are unable to freeze a block with bankhash EYc7sdGmvWAmFyGN5DWhYx6MR8KxFcqag1HoFArNdRDF, instead we have a block with bankhash Some(Fg8G6EjERZmTBSn8rKCdTRqmHVx6JE4BouyRHeQZYnMS). This is most likely a bug in the runtime. At this point manual intervention is needed to make progress. Exiting" location="core/src/replay_stage.rs:1414:25" version="1.18.0 (src:00000000; feat:4046558620, client:SolanaLabs)"
log from jw14
running without this PR for the most recent epoch boundary.
ath the epoch boundary 243216000., the node crashes because of the bankhash mismatch.
2. Running ledger-tool with the PR on jw14
successfully.
./cargo run --release --bin solana-ledger-tool -- --ledger ~/ledger verify --partitioned-epoch-rewards-force-enable-single-slot --halt-at-slot 243216000 2>&1 | tee verify.log
[2024-01-22T15:53:39.518831156Z INFO solana_runtime::bank] bank frozen: 243216000 hash: EYc7sdGmvWAmFyGN5DWhYx6MR8KxFcqag1HoFArNdRDF accounts_delta: 7pNJmQoSte2DL18ZMgKfsz3mZYURgeFkTo7Et9GSoFWG signature_count: 0 last_blockhash: ES2RQg1vA1Vgpz1qmDgWB2KJyGC4jivHr2PRoStSh5Ra capitalization: 567729394157396052, stats: BankHashStats { num_updated_accounts: 829464, num_removed_accounts: 1, num_lamports_stored: 391291053032340887, total_data_len: 173784497, num_executable_accounts: 2 }
EYc7sdGmvWAmFyGN5DWhYx6MR8KxFcqag1HoFArNdRDF
bank hash matched.
Furthermore, the log shows that epoch_reward partition account is created as expected!
[2024-01-22T15:53:30.882687996Z INFO solana_runtime::bank] create epoch_reward partition data account F6amLHBgM4h9AHz9BcyRsUymCbtMvd739Sd1WZKGdscP V0(
PartitionData {
num_partitions: 1,
parent_blockhash: 4ePEv3ksbuvAehYXFCHPsj3z5jRJ8MLC1bp8V8ma7aiS,
hasher_kind: Sip13,
},
)
Is this an issue because with the CLI flag we create the PDA account, thus producing a different bank hash because the rest of the cluster does not have this account?
Is this an issue because with the CLI flag we create the PDA account, thus producing a different bank hash because the rest of the cluster does not have this account?
Yeah. Exactly.
Yeah. Exactly.
Great! Then yes, this idea looks safe to me. We already do the same thing for the sysvar accounts, so also ignoring another account makes sense.
will capitalization still match with the bank? does this sysvar have any lamports?
part 2. When is the sysvar deleted? We had to work on the earlier cli pr in order to make sure we deleted the sysvar that has the rewards themselves (as lamports). Since the cli arg does it all in one spot, the rewards sysvar gets created, drained, then deleted.
will capitalization still match with the bank? does this sysvar have any lamports?
Good question. No, capitalization will be mismatching. We need to fix that too.
The new accounts are not sysvar accounts. One account is created for each epoch. And they won't be deleted. Instead, they will stay forever.
Also, turning the CLI on/off affects epoch_account_hash too.
I have been running a node that toggles this CLI between epoch and it seems to result in epoch_account_hash mismatches. We have to fix epoch account hash too, i.e. (exclude the PDA accounts).
https://github.com/solana-labs/solana/issues/34876 has more details.
I restarted jw14
with the new fixes for lamport and epoch account hash. Let's see if this works.
Block height: 224307951
Slot: 243486306
Epoch: 563
Transaction Count: 263486182012
Epoch Slot Range: [243216000..243648000)
Epoch Completed Percent: 62.571%
Epoch Completed Slots: 270306/432000 (161694 remaining)
Epoch Completed Time: 1day 7h 6m 6s/2days 1h 39m 5s (18h 32m 59s remaining)
We will know at 564.75 epoch, now at 563.62 epoch, ~ 1.13 epoch away.
I'm holding off review until churn is done, and CI is green. Feel free to give me a nudge if it's at that point
My test node passed the epoch account hash for epoch 564. The fix works!