CCF
CCF copied to clipboard
Investigate recovery behaviour when old snapshot is used
The CCF primary generates a new snapshot at the end of the private recovery procedure. However, the snapshot may not be generated (see #1858) or made available to a subsequent recovered service. In this case, it is possible that an old snapshot generated for service N is used to start a new recovered service N+2 (i.e., no snapshot for service N+1 is available). In this case, the N+2 node will throw a Previous service identity does not endorse the node identity that signed the snapshot error on startup because the specified previous_service_identity_file is for N+1 but the startup snapshot is for N.
The following non-exhaustive list of options is considered to prevent such scenario, as the current issue requires manual operator intervention to complete the DR process:
- Have operators keep track of which service certificate corresponds to which snapshot. This likely to be fiddly for operators and would prevent a node from starting a new
N+2service from a snapshot altogether if a snapshot isn't available forN+1. - Have the CCF node verify that the snapshot it's started from is eventually (i.e. by the end of public recovery) endorsed by the specified
previous_service_identity_file, i.e. node recovers from snapshotNandprevious_service_identity_file = N+1, applies that snapshot and deserialises the public ledger suffix and picks up theNcertificate from the ledger. This means that the CCF node would apply the snapshot and deserialise the public ledger before it's verified the snapshot endorsement, which may have security implications. - Similar to 2. but move this process up-front to a Python script so that we get an early error on node startup rather than having to wait until the end of the public recovery. Operators would have to run that script to pick up the chain of service certificates between
NandN+1(possibly many!) and then pass it to the node on startup in a newprevious_service_identity_fileS. This could be slow (as the ledger suffix would have to be deserialised twice) but maintains the early error behaviour on the CCF node.
Thanks so much for listing some issues and mitigation options.
I think the best way to prevent this issue from occurring it to understand reasons as to why the snapshot is not being generated/ being made available properly so that we never fall into this state. (Such as #1858) In the case where we do fall into this state, I think that option 3 seems like a good approach that is decoupled due to the separate python script and can be run early on to see if the ledger is in this state or not. I believe ACL has an endpoint that can be used to easily update the previous_service_identity_file. I see the steps being:
- Before running a recovery, execute the python script to check if we are in this state where we do not have the latest recovered snapshot.
1a. If we are not in this state, do not update
previous_service_identity_file1b. If we are in this state, update theprevious_service_identity_filewith the identity that is returned by the chain of service certs between the two services - Run the regular recovery process
Open question:
Just to be clear - would we have to update the previous_service_identity_file multiple times if the chain of service is really behind by multiple service identities and if so, how would this process of updating multiple service identities work?
@PallabPaul Thank you for your response.
The reasons I see for recent snapshots not be available for a new recovered service are:
- Recent snapshots were too large to be generated as they include historical deleted keys. This will be solved in 3.0.0 with #4145
- Recent snapshots were too large to be generated as application code allows for many keys to be recorded. This is up to the application code (e.g. ACL) and operators monitor this closely and alert the users if necessary.
- The operator failed to copy the recent snapshots to the new recovery node on time.
In the case you describe, the operator has to run the Python script before each recovery, which may lengthen recovery time slightly as ledger will need to be parsed. With regard to your question, I think the CCF node would allow the operator to pass a chain of service certificates rather than a single one as it is currently.
@lynshi FYI, this covers part of what we discussed yesterday (but not the joining of new nodes from snapshots pre-dating the recovery).