solana
solana copied to clipboard
sudo reboot mb validator recovery fails
Problem
1.17.20, 10-12 Feb 2024.
I was curious about a mb validator’s recovery ability. And so I used a spare non-voting mb validator to see if it could recover from an abrupt sudo reboot
I ran default validator settings, so incremental snapshots happen every minute and full-snapshots every 3 hours.
I tried 2 different validator startup scripts:
Script A: included —use-snapshot-archives-at-startup when newest
Script B: it was removed
Test Methodology 1:
sudo reboot
when incremental snapshots are available and less than 1 minute old:
Script A: 3 reboots, 3 successful recoveries each in approx 13 mins Script B: 3 reboots, 3 successful recoveries each in approx 15 mins
So far so good!
Test Methodology 2:
However, approximately 10 mins before the 3-hour full snapshot is due, the validator stops creating minute-by-minute incrementals and starts only creating the next full snapshot. This means the last incremental gets up to ~15 mins old. [At least, this is my interpretation of what it looks like it's doing!].
sudo reboot
at various times with an old/aging incremental during full snapshot creation (for clarity this is approx a 15 minute window every 3 hours):
Script A: Incremental 8 mins old: Failed Incremental 6 mins old: Failed Incremental 5 mins old: Failed
Script B: Incremental 7 mins old: Success, took 19 mins Incremental 9 mins old: Success, took 20 mins Incremental 13 mins old: Success, took 23 mins
For Test Methodology 2 using Script A, each time it failed for ERROR solana_ledger::bank_forks_utils] Failed to load bank: AccountsFile error: AppendVecError: incorrect layout/length/data in the appendvec at path /mnt/solana-accounts/run/247471897.30603
Proposed Solution
On the advice of Brooks in the discord this issue is opened to address Script A - Test Methodology 2 - failing.
I'll take a look. Thanks for filling this issue.
Had a similar issue when upgrading from 1.17.20 to 1.17.22. When including --use-snapshot-archives-at-startup when-newest
the node would keep crashing with:
Failed to load bank: AccountsFile error: AppendVecError: incorrect layout/length/data in the appendvec at path...
Removing the arg fixed the issue
Had a similar issue when upgrading from 1.17.20 to 1.17.22. When including
--use-snapshot-archives-at-startup when-newest
the node would keep crashing with:Failed to load bank: AccountsFile error: AppendVecError: incorrect layout/length/data in the appendvec at path...
Removing the arg fixed the issue
@segfaultdoc - Out of curiosity, how was your node stopped / restarted? Similar to what the author of this issue is doing (maybe systemctl stop sol
), or are you using something like solana-validator exit
?
So far I have been unable to reproduce a failure with fastboot. Here's the experiments I've performed so far. For all of them I have specified --use-snapshot-archives-at-startup when-newest
on the cli.
Note that the terminology may not make sense, as this is copy-pasted from my own internal notes.
# | initial version | restart version | restart method | result |
---|---|---|---|---|
1 | v1.17.23 | v1.17.23 | ./restart with bank snapshot POST | OK |
2 | v1.17.23 | v1.17.23 | ./restart with bank snapshot PRE | OK, uses next POST correctly |
3 | v1.17.23 | v1.17.23 | ./stop then ./restart withOUT a bank snapshot, just account hard links | OK, uses next POST correctly |
4 | v1.17.23 | v1.18.3 | ./stop then ./restart withOUT a bank snapshot, just account hard links | OK, uses next POST correctly |
5 | v1.18.3 | v1.17.22 | ./restart with bank snapshot POST | OK |
6 | v1.17.22 | v1.17.22 | kill -9 then ./restart | OK |
7 | v1.17.22 | v1.18.2 | kill -9 then ./restart | OK |
- 1 through 5 are all graceful shutdowns, whereas 6 and 7 are not
- 1-3 and 6 are all the same minor version, whereas 4 & 7 are upgrades, and 5 is a downgrade
I'm not sure what else to try at the moment. Are there other permutations I've missed?
Had a similar issue when upgrading from 1.17.20 to 1.17.22. When including
--use-snapshot-archives-at-startup when-newest
the node would keep crashing with:Failed to load bank: AccountsFile error: AppendVecError: incorrect layout/length/data in the appendvec at path...
Removing the arg fixed the issue
@segfaultdoc - Out of curiosity, how was your node stopped / restarted? Similar to what the author of this issue is doing (maybe
systemctl stop sol
), or are you using something likesolana-validator exit
?
Similar to author, restarted the systemd service. I'm starting to think maybe we're not exiting cleanly in jito-solana
. Do you know if a panic in some thread on exit would cause this?
See attached screenshot.
This shows Script B, Test Methodology 2. ls -lh
was done seconds before the sudo reboot
so its an accurate depiction of the ledger directory as the reboot was executed. The full snapshot has got to 28G out of ~58G and the last incremental was 7 minutes ago.
This Script B recovered successfully, in approx 19 minutes.
However, when --use-snapshot-archives-at-startup when-newest
was added (Script A) it would fail to recover.
Do you know if a panic in some thread on exit would cause this?
This is what I was trying to reproduce by randomly killing the validator process in 6 & 7. It's possible I just didn't hit the issue too. Two runs is not a lot.
See attached screenshot.
The --use-snapshot-archives-at-startup when-newest
cli arg does not use the snapshot archives, so in theory this should not impact anything. I'll try it out though.
If you happen to have the contents of what's in your /mnt/solana-ledger/snapshot
directory, that would be interesting. I would expect it to have a directory with number higher than 24570877.
That was a screenshot I took at the time (12 Feb, 1.17.20). Afriad I no longer have the server.
After just hitting a space issue on my snapshot dir I crashed with this:
thread 'solSnapshotPkgr' panicked at core/src/snapshot_packager_service.rs:81:26:
failed to archive snapshot package: Io(Custom { kind: StorageFull, error: Error { kind: Write, source: Os { code: 28, kind: StorageFull, message: "No space left on device" }, path: "/mnt/snapshots/tmp-snapshot-archive-250193679.tar.zst" } })
Which led to a boot crash loop with this being the only error there:
[2024-02-24T17:02:06.317297546Z ERROR solana_ledger::bank_forks_utils] Failed to load bank: AccountsFile error: AppendVecError: incorrect layout/length/data in the appendvec at path /mnt/validator/accounts/run/250194513.13821324
snapshot: /mnt/snapshots/snapshot/250196251/250196251
Further context on Discord: https://discord.com/channels/428295358100013066/689412830075551748/1210995187950686288
Ok, I've found the (an?) problem. The PR to fix it is here: https://github.com/solana-labs/solana/pull/35350
Found the other problem. Here's a GH Issue for it: https://github.com/solana-labs/solana/issues/35367.
That’s great news Brooks - thanks
https://github.com/solana-labs/solana/pull/35350 has been merged, so the recovery aspect of this failure is now fixed. Other PRs will fix the underlying issues.