CCF icon indicating copy to clipboard operation
CCF copied to clipboard

Investigate recovery behaviour when old snapshot is used

Open jumaffre opened this issue 3 years ago • 3 comments

The CCF primary generates a new snapshot at the end of the private recovery procedure. However, the snapshot may not be generated (see #1858) or made available to a subsequent recovered service. In this case, it is possible that an old snapshot generated for service N is used to start a new recovered service N+2 (i.e., no snapshot for service N+1 is available). In this case, the N+2 node will throw a Previous service identity does not endorse the node identity that signed the snapshot error on startup because the specified previous_service_identity_file is for N+1 but the startup snapshot is for N.

The following non-exhaustive list of options is considered to prevent such scenario, as the current issue requires manual operator intervention to complete the DR process:

  1. Have operators keep track of which service certificate corresponds to which snapshot. This likely to be fiddly for operators and would prevent a node from starting a new N+2 service from a snapshot altogether if a snapshot isn't available for N+1.
  2. Have the CCF node verify that the snapshot it's started from is eventually (i.e. by the end of public recovery) endorsed by the specified previous_service_identity_file, i.e. node recovers from snapshot N and previous_service_identity_file = N+1, applies that snapshot and deserialises the public ledger suffix and picks up the N certificate from the ledger. This means that the CCF node would apply the snapshot and deserialise the public ledger before it's verified the snapshot endorsement, which may have security implications.
  3. Similar to 2. but move this process up-front to a Python script so that we get an early error on node startup rather than having to wait until the end of the public recovery. Operators would have to run that script to pick up the chain of service certificates between N and N+1 (possibly many!) and then pass it to the node on startup in a new previous_service_identity_fileS. This could be slow (as the ledger suffix would have to be deserialised twice) but maintains the early error behaviour on the CCF node.

jumaffre avatar Nov 11 '22 11:11 jumaffre

Thanks so much for listing some issues and mitigation options.

I think the best way to prevent this issue from occurring it to understand reasons as to why the snapshot is not being generated/ being made available properly so that we never fall into this state. (Such as #1858) In the case where we do fall into this state, I think that option 3 seems like a good approach that is decoupled due to the separate python script and can be run early on to see if the ledger is in this state or not. I believe ACL has an endpoint that can be used to easily update the previous_service_identity_file. I see the steps being:

  1. Before running a recovery, execute the python script to check if we are in this state where we do not have the latest recovered snapshot. 1a. If we are not in this state, do not update previous_service_identity_file 1b. If we are in this state, update the previous_service_identity_file with the identity that is returned by the chain of service certs between the two services
  2. Run the regular recovery process

Open question: Just to be clear - would we have to update the previous_service_identity_file multiple times if the chain of service is really behind by multiple service identities and if so, how would this process of updating multiple service identities work?

PallabPaul avatar Nov 14 '22 17:11 PallabPaul

@PallabPaul Thank you for your response.

The reasons I see for recent snapshots not be available for a new recovered service are:

  • Recent snapshots were too large to be generated as they include historical deleted keys. This will be solved in 3.0.0 with #4145
  • Recent snapshots were too large to be generated as application code allows for many keys to be recorded. This is up to the application code (e.g. ACL) and operators monitor this closely and alert the users if necessary.
  • The operator failed to copy the recent snapshots to the new recovery node on time.

In the case you describe, the operator has to run the Python script before each recovery, which may lengthen recovery time slightly as ledger will need to be parsed. With regard to your question, I think the CCF node would allow the operator to pass a chain of service certificates rather than a single one as it is currently.

jumaffre avatar Nov 16 '22 15:11 jumaffre

@lynshi FYI, this covers part of what we discussed yesterday (but not the joining of new nodes from snapshots pre-dating the recovery).

achamayou avatar Nov 29 '22 09:11 achamayou