CCF icon indicating copy to clipboard operation
CCF copied to clipboard

Snapshot files should be sanitized before being selected by a CCF node at startup

Open andpiccione opened this issue 1 year ago • 4 comments
trafficstars

Describe the bug

We came across a scenario where a CCF node may start from a "blank" snapshot, forcing the node to start from ledger files instead and increasing the node startup time. As per this method, CCF is choosing the latest snapshot file in the target snapshot directory based on the file name only (picking the one with the highest sequence number); it would be sensible to have a few sanity checks in place to verify that a candidate snapshot file is indeed valid, or else ignore it and look for the newest valid snapshot in the directory.

To Reproduce

  1. Start a CCF network
  2. Configure a new node to join the main network, making it use a snapshot directory containing a 0-byte snapshot file with the highest sequence number. Upon startup, the CCF node loads the 0-byte snapshot, but it may eventually end up replaying transactions from the ledger files because the snapshot is empty / not usable.

Expected behavior

Before selecting a snapshot for a join/recovery node, CCF should check that the file is valid (e.g., either by running a few simple sanity checks or by directly loading and parsing the file). If the file is deemed not valid, the file should be skipped and the validation process should be repeated for the next candidate snapshot in the directory until a suitable one is found.

Environment information

Tested on an Azure Managed CCF instance running on SGX platform, using a CCF 5.0.0-dev15 build from the CCF mcr.microsoft.com:ccf/app/run-js:5.0.0-dev15-sgx runtime image.

Additional context

We have seen a few occurrences of a 0-byte snapshot file in the shared ledger directory, which presumably originates from a concurrent copy of the file from the ledger volume to the shared volume (using azcopy) while it's still being written to by the node (rare situation, but it could explain why we haven't seen many occurrences so far). We are fixing the error on our side by overwriting the file in the shared volume during the copy if the source file is newer, but we were thinking it could also make sense for CCF to do a quick "sanity check" of a candidate snapshot to select, to ensure it is valid.

andpiccione avatar May 21 '24 11:05 andpiccione