cardano-node [BUG] - Invalid snapshot DiskSnapshot, Replaying ledger from genesis

Internal/External External

Area Governance Related to CIP1694 implementation Other

Summary I am trying to test 9.0.0 on preview net. A clean deployment of 9.0.0 from version 8.12.2 produces the below errors and causes the leger to be replayed from genesis.

[76903c42:cardano.node.ChainDB:Error:5] [2024-07-16 03:32:58.89 UTC] Invalid snapshot DiskSnapshot {dsNumber = 54433676, dsSuffix = Nothing}InitFailureRead (ReadFailed (DeserialiseFailure 373017483 "Size mismatch when decoding Record RecD.\nExpected 4, but found 5."))
[76903c42:cardano.node.ChainDB:Error:5] [2024-07-16 03:33:28.45 UTC] Invalid snapshot DiskSnapshot {dsNumber = 54433086, dsSuffix = Nothing}InitFailureRead (ReadFailed (DeserialiseFailure 373017841 "Size mismatch when decoding Record RecD.\nExpected 4, but found 5."))

Steps to reproduce Steps to reproduce the behavior:

Run a relay with cardano-node version 8.12.2
Allow it to sync and run stably for some time
Gracefully shut down the 8.12.2 node
Launch the 9.0.0 version of Cardano Node
Read the logs and note whether ledger is replayed from genesis

I also tested this on preprod net with the same results. Before trying on the preprod net I verified 8.12.2 was able to restart without any issues at all. The issue seems specific to the upgrade.

Expected behavior I expect the node to detect the clean shutdown and resume from where the ledger is like when I restart the 8.12.2 version.

System info (please complete the following information):

OS Name: Ubuntu
OS Version: 22.04
Node version 9.0.0 from 8.12.2
Running the official cardano-node docker image

Additional context I did change a config value to disable PeerSharing but I would not expect that to cause this. First bug report, apologies if I missed something.

Jul 16 '24 04:07 gnomed

I suspect this issue is expected. Often times when upgrading the node, the disk format of snapshots changes thus requiring a replay from genesis. I will let @IntersectMBO/cardano-ledger confirm if this is expected though.

Jul 23 '24 09:07 jasagredo

I got this too. If it is intended, it would be nice if it could be handled better and a more suitable message so it's clearer of this is a bug - for example by storing a version number and outputting that the version number has changed and it must be replayed.

Relying on it failing because of a size difference also feels odd, because what if there are changes to the format but the size does not change? Would it blindly read potentially incorrect data? 🤔

Jul 26 '24 12:07 DanTup

In some sense the ship has already sailed so perhaps the point is moot but it would be nice to see this type of functionality extended in more of a backward compatible way.

For example having the node fall back to the old format if it encounters an error during reading, but write the new format going forward. Would only need to keep the backward compatibility around for the next 2 minor versions.

I might be speaking nonsense since I have little knowledge of the underlying implementation logic, in which case I apologize.

Jul 26 '24 15:07 gnomed

In some sense the ship has already sailed

It has for the current version, but surely the next time the data format changes, a version number could be added, and then subsequent versions could check it?

For example having the node fall back to the old format if it encounters an error during reading, but write the new format going forward.

I know nothing about how this works, but my guess is that reading the old format would involve keeping two sets of code/data structures that they currently do not, and they'd need to convert them as they're read at startup into the new format anyway (so the node can operate with the new structures). I also know nothing about this though 😄

I'm not against the rebuilding of the data now and then, it would just be nice if it was clear from the logs whether this is expected or might be that my store got corrupted (in which case, I might have a problem I want to investigate!).

Jul 26 '24 15:07 DanTup

Two things:

this is not expected from 8.12.2 -> 9.0/9.1. There were no changes to ledger requiring a replay between these versions. I would look into your deployment logic and make sure you don't have something that terminates the node if it doesn't stop quick enough.
Yes, we've been talking about a version number for ledger state for a while now. It's in the backlog, but isn't quite as simple as just adding a version number. It requires some communication between ledger and consensus to negotiate this and feed it back to the user in a trace.

Jul 27 '24 12:07 disassembler

For example having the node fall back to the old format if it encounters an error during reading, but write the new format going forward. Would only need to keep the backward compatibility around for the next 2 minor versions.

The new version of the node doesn't know the old format. That serialization is in a previous version of the code. It can only tell the serialization is broken, but not why it's broken. I do agree, a better error message based on a versioning number would make this nicer though.

Jul 27 '24 12:07 disassembler

this is not expected from 8.12.2 -> 9.0/9.1. There were no changes to ledger requiring a replay between these versions. I would look into your deployment logic and make sure you don't have something that terminates the node if it doesn't stop quick enough.

Thanks, I was worried that would be the answer. I'm running with microk8s and I shut the nodes down by scaling to 0 instances. My understanding was that this would gracefully shutdown the container (it's using the official docker container).

What's odd though, is that of all the times I recall seeing it do this, it's always been when I upgrade to a new version of the node, I don't recall every seeing it when I do the same to make k8s config changes or reboot the host for OS updates 🤔

Is it possible the error could include more information (for example what file it was reading, what data it read?) to help understand what might be wrong when it occurs?

Jul 27 '24 12:07 DanTup

this is not expected from 8.12.2 -> 9.0/9.1. There were no changes to ledger requiring a replay between these versions. I would look into your deployment logic and make sure you don't have something that terminates the node if it doesn't stop quick enough.

I strongly disagree with this suggestion and statement. I use the same rollout process for all node changes and that process works correctly for other updates that I have made in the weeks prior and since this bug report (upgrade from 9.0 to 9.1 was seamless in preview and preprod). Also the error message for a dirty shutdown is completely different than this snapshot error message. The only variable that reliably triggers this behaviour is going from 8.x to 9.x.

new version of the node doesn't know the old format

Yes, I was suggesting keeping both implementations in the same version for a couple releases of transition as only one possible strategy (out of many possible strategies) for backward compatability. Inspired slightly by the combinator approach used by cardano node itself for its own protocol migration.

Jul 28 '24 18:07 gnomed

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 120 days.

Aug 28 '24 02:08 github-actions[bot]

Closing as the point is more or less moot for this specific version change and negligible value.

I do hope the next time the schema changes someone makes it backward compatible for at least one version cycle please.

Sep 22 '24 22:09 gnomed