solana icon indicating copy to clipboard operation
solana copied to clipboard

sudo reboot mb validator recovery fails

Open john-smith-solana opened this issue 4 months ago • 8 comments

Problem

1.17.20, 10-12 Feb 2024.

I was curious about a mb validator’s recovery ability. And so I used a spare non-voting mb validator to see if it could recover from an abrupt sudo reboot

I ran default validator settings, so incremental snapshots happen every minute and full-snapshots every 3 hours.

I tried 2 different validator startup scripts: Script A: included —use-snapshot-archives-at-startup when newest Script B: it was removed


Test Methodology 1:

sudo reboot when incremental snapshots are available and less than 1 minute old:

Script A: 3 reboots, 3 successful recoveries each in approx 13 mins Script B: 3 reboots, 3 successful recoveries each in approx 15 mins

So far so good!


Test Methodology 2:

However, approximately 10 mins before the 3-hour full snapshot is due, the validator stops creating minute-by-minute incrementals and starts only creating the next full snapshot. This means the last incremental gets up to ~15 mins old. [At least, this is my interpretation of what it looks like it's doing!].

sudo reboot at various times with an old/aging incremental during full snapshot creation (for clarity this is approx a 15 minute window every 3 hours):

Script A: Incremental 8 mins old: Failed Incremental 6 mins old: Failed Incremental 5 mins old: Failed

Script B: Incremental 7 mins old: Success, took 19 mins Incremental 9 mins old: Success, took 20 mins Incremental 13 mins old: Success, took 23 mins

For Test Methodology 2 using Script A, each time it failed for ERROR solana_ledger::bank_forks_utils] Failed to load bank: AccountsFile error: AppendVecError: incorrect layout/length/data in the appendvec at path /mnt/solana-accounts/run/247471897.30603

Proposed Solution

On the advice of Brooks in the discord this issue is opened to address Script A - Test Methodology 2 - failing.

john-smith-solana avatar Feb 13 '24 23:02 john-smith-solana

I'll take a look. Thanks for filling this issue.

brooksprumo avatar Feb 14 '24 00:02 brooksprumo

Had a similar issue when upgrading from 1.17.20 to 1.17.22. When including --use-snapshot-archives-at-startup when-newest the node would keep crashing with:

Failed to load bank: AccountsFile error: AppendVecError: incorrect layout/length/data in the appendvec at path...

Removing the arg fixed the issue

segfaultdoc avatar Feb 17 '24 18:02 segfaultdoc

Had a similar issue when upgrading from 1.17.20 to 1.17.22. When including --use-snapshot-archives-at-startup when-newest the node would keep crashing with:

Failed to load bank: AccountsFile error: AppendVecError: incorrect layout/length/data in the appendvec at path...

Removing the arg fixed the issue

@segfaultdoc - Out of curiosity, how was your node stopped / restarted? Similar to what the author of this issue is doing (maybe systemctl stop sol), or are you using something like solana-validator exit ?

steviez avatar Feb 17 '24 19:02 steviez

So far I have been unable to reproduce a failure with fastboot. Here's the experiments I've performed so far. For all of them I have specified --use-snapshot-archives-at-startup when-newest on the cli.

Note that the terminology may not make sense, as this is copy-pasted from my own internal notes.

# initial version restart version restart method result
1 v1.17.23 v1.17.23 ./restart with bank snapshot POST OK
2 v1.17.23 v1.17.23 ./restart with bank snapshot PRE OK, uses next POST correctly
3 v1.17.23 v1.17.23 ./stop then ./restart withOUT a bank snapshot, just account hard links OK, uses next POST correctly
4 v1.17.23 v1.18.3 ./stop then ./restart withOUT a bank snapshot, just account hard links OK, uses next POST correctly
5 v1.18.3 v1.17.22 ./restart with bank snapshot POST OK
6 v1.17.22 v1.17.22 kill -9 then ./restart OK
7 v1.17.22 v1.18.2 kill -9 then ./restart OK
  • 1 through 5 are all graceful shutdowns, whereas 6 and 7 are not
  • 1-3 and 6 are all the same minor version, whereas 4 & 7 are upgrades, and 5 is a downgrade

I'm not sure what else to try at the moment. Are there other permutations I've missed?

brooksprumo avatar Feb 21 '24 15:02 brooksprumo

Had a similar issue when upgrading from 1.17.20 to 1.17.22. When including --use-snapshot-archives-at-startup when-newest the node would keep crashing with:

Failed to load bank: AccountsFile error: AppendVecError: incorrect layout/length/data in the appendvec at path...

Removing the arg fixed the issue

@segfaultdoc - Out of curiosity, how was your node stopped / restarted? Similar to what the author of this issue is doing (maybe systemctl stop sol), or are you using something like solana-validator exit ?

Similar to author, restarted the systemd service. I'm starting to think maybe we're not exiting cleanly in jito-solana. Do you know if a panic in some thread on exit would cause this?

segfaultdoc avatar Feb 21 '24 19:02 segfaultdoc

sudo reboot

See attached screenshot.

This shows Script B, Test Methodology 2. ls -lh was done seconds before the sudo reboot so its an accurate depiction of the ledger directory as the reboot was executed. The full snapshot has got to 28G out of ~58G and the last incremental was 7 minutes ago.

This Script B recovered successfully, in approx 19 minutes.

However, when --use-snapshot-archives-at-startup when-newest was added (Script A) it would fail to recover.

john-smith-solana avatar Feb 21 '24 20:02 john-smith-solana

Do you know if a panic in some thread on exit would cause this?

This is what I was trying to reproduce by randomly killing the validator process in 6 & 7. It's possible I just didn't hit the issue too. Two runs is not a lot.


See attached screenshot.

The --use-snapshot-archives-at-startup when-newest cli arg does not use the snapshot archives, so in theory this should not impact anything. I'll try it out though.

If you happen to have the contents of what's in your /mnt/solana-ledger/snapshot directory, that would be interesting. I would expect it to have a directory with number higher than 24570877.

brooksprumo avatar Feb 21 '24 20:02 brooksprumo

That was a screenshot I took at the time (12 Feb, 1.17.20). Afriad I no longer have the server.

john-smith-solana avatar Feb 21 '24 21:02 john-smith-solana

After just hitting a space issue on my snapshot dir I crashed with this:

thread 'solSnapshotPkgr' panicked at core/src/snapshot_packager_service.rs:81:26:
failed to archive snapshot package: Io(Custom { kind: StorageFull, error: Error { kind: Write, source: Os { code: 28, kind: StorageFull, message: "No space left on device" }, path: "/mnt/snapshots/tmp-snapshot-archive-250193679.tar.zst" } })

Which led to a boot crash loop with this being the only error there:

[2024-02-24T17:02:06.317297546Z ERROR solana_ledger::bank_forks_utils] Failed to load bank: AccountsFile error: AppendVecError: incorrect layout/length/data in the appendvec at path /mnt/validator/accounts/run/250194513.13821324
    snapshot: /mnt/snapshots/snapshot/250196251/250196251

Further context on Discord: https://discord.com/channels/428295358100013066/689412830075551748/1210995187950686288

michaelh-laine avatar Feb 24 '24 17:02 michaelh-laine

Ok, I've found the (an?) problem. The PR to fix it is here: https://github.com/solana-labs/solana/pull/35350

brooksprumo avatar Feb 28 '24 15:02 brooksprumo

Found the other problem. Here's a GH Issue for it: https://github.com/solana-labs/solana/issues/35367.

brooksprumo avatar Feb 29 '24 02:02 brooksprumo

That’s great news Brooks - thanks

john-smith-solana avatar Feb 29 '24 14:02 john-smith-solana

https://github.com/solana-labs/solana/pull/35350 has been merged, so the recovery aspect of this failure is now fixed. Other PRs will fix the underlying issues.

brooksprumo avatar Feb 29 '24 19:02 brooksprumo