mpc-recovery
mpc-recovery copied to clipboard
Add Key Migration Logic
In case we need to change one of the validationg nodes, we will need to have this functionality. Same goes for the situation when we move to treshold signing.
Potential Implementation on Key Rotation
To keep running the MPC service without any downtime, the proposed pipeline roughly looks like this:
- Rotate key is called on a sign node.
- Sign node generates new secret and cipher key for node.
- New key is made known to leader node.
- Leader node purges old pk-set.
- NOTE: if we had threshold, we don't necessarily need to purge the pk-set as we can generate new secret shares on top of the master secret where the non-compromised nodes does the generation.
- Export datastore as a backup just in case migration fails.
- Migration is happening in background, where records are being re-encrypted with new cipher key.
- Sign node still has old cipher key such that if the record hasn't been updated yet, it can be used to decrypt as the migration is ongoing.
- New cipher key is used for new requests to create account.
- delete old cipher key and old datastore backup if migration succeeded
Even if we switch to threshold keys later, this part of key rotation should still be roughly the same.
Open Questions
Who triggers the key rotation?
- Leader node calls Sign node to rotate keys
- Leader node need to somehow know that a sign node was compromised?
- Sign node rotate keys and tells Leader node about it
- We can have a heartbeat endpoint where Leader node can check the current state of a Sign node. Probably ideal to have this anyways since a Sign node can be down where the heartbeat would alert us of this.
Great summary @ChaoticTempest! This ticket seems like it will be a bigger effort than expected, so let's create some subtasks here. I think we can start working roughly from the end of the flow and do something like this:
- [ ] We should be able to provide sign node with two secret/cipher pairs and it would use the first one to register new accounts and decrypt them, but in case it fails it will use the second pair as backup. This can be tested reasonably independently and we can create a temporary CLI parameter to provide both pairs for now.
- [ ] Extend above with migration capabilities as per your description. Again, we can test this functionality independently - e.g. give sign node key A, create a user U1, restart sign node with keys B+A, create a user U2, sleep for some time, restart sign node with key B, check that both U1 and U2 are still recoverable (i.e. U1 was migrated from A to B).
- [ ] Extend above with backup. Testing is easy: just let the migration flow run and test that restarting with backup still works.
- [ ] Add a trigger mechanism that completes the flow
I would also add another open question: how should the rotation be triggered? We want to ensure that we don't trigger it accidentally and also that it is only called by an authorized entity. Maybe it should be an endpoint that depends on a cold key stored outside of the system?
I would also add another open question: how should the rotation be triggered? We want to ensure that we don't trigger it accidentally and also that it is only called by an authorized entity. Maybe it should be an endpoint that depends on a cold key stored outside of the system?
This could also just be a fastauth account that sort of has admin privileges if we want key recovery mechanisms later on, but for now a cold key should do the trick
Just had an offline discussion with @volovyks and so far with what I've written before, we would just be tackling the node-key rotation parts of it, and not really the per user keys stored on GCP. Which would be required if we added a new node/partner. So in addition to that part, we came to the conclusion that this work might be a one-off where we'll probably only utilize it once to migrate to threshold signing or add another node/partner. Meanwhile, it adds this extra endpoint that would require us to maintain it long term and a new process for rotating out the keys of per-user. That process would be pretty complicated as well, requiring us to swap out the actual recovery key on chain (since we don't have threshold yet).
Then we have these new questions up for discussion:
-
Is node-key rotation required at all?
- After looking at threshold signing again, it's not required for the case of key-resharing since new keys can just be generated on top of the old ones. But we might still need it for the case of clearing out node shares that have left the network or got compromised somehow.
-
How do we deal with per-user key rotations in the current system?
- @volovyks and I thought we shouldn't because that's a pretty complicated process, requiring us to orchestrate removing the recovery key on chain and replacing it with a new one (which isn't ideal since there's a limit on how many keys we can add?). All this work would then be moot once we move to threshold anyways. However, this work would be required if we were to add in a new partner in the next couple of months. So I'd argue we try to get to threshold ASAP just to avoid all this hassle.
@DavidM-D @itegulov WDYT?
@volovyks anything you want to add in with this?
So I largely agree, this is more akin to a database migration than it is to something we should have an endpoint for. This can be a relatively manual process, but we will inevitably need to do this at some point and I'd like us to do the ground work sooner rather than later. The reason is that if an MPC node provider says I don't want to be a node provider any more, we need to be ready to migrate off pretty quickly.
I'm happy if we just have a process in place and all the code we'll need to run that process written down.
I'll just get something scripted up and have a CLI call path for now then. This should be the simplest without having to keep maintaining an endpoint where an attacker can potentially abuse
Onto the design for moving over to migrating user recovery keys on a new node joining the network or on a node leaving.
As a reminder, currently, a user recovery key is generated by going through the following process:
- Leader node gets called with
/new_account - This triggers a call into
/public_keyinto each sign node. - Which returns a public key share to be combined in the leader node to get us the recovery key.
- Recovery key is added to a delegate action and is sent to relayer to be committed to the network.
Then I propose we go through this process flow for replacing the recovery key when a new node joins and/or an old node leaves:
- A new leader node cli endpoint is added for calling into
/user_credentialsto get the current recovery key shares for an account-id we'll calldeadbeef.- NOTE: eventually requires a list of all users we have in sign node and might need a new sign node endpoint to retrieve a list of all users each sign node has)
- The leader node now calls into the new nodes
/public_keyor/user_credentialsfor the account-iddeadbeefto generate it's respective public key share and get this back on the leader node side.
At this stage, we should have the individual pieces of the recovery key of all the nodes in our old and new system.
- We can then displace the node we want to replace with the new node's recovery share, creating a new recovery key for
deadbeefwithout having the current nodes in the system needing to change anything in the datastore (since the recovery key was originally generated from a series of public keys, we can do the same here by just having a new public key enter the mix with the old one being removed). We'll then call into/signflow with the old set of nodes to push the transaction through, replacing the old recovery key with the new one.- TODO: Still need to see how viable it is to replace a key with number of key limitations
Once the new recovery key is committed to the chain, there's really no going back unless we do another transaction to go back to the old one... So if a failure were to happen along the way and the migration needs to be rolled back, we should store a mapping of the migrated account ids to their respective old recovery shares. We need this mapping anyways for zero downtime, as we are required to point the migrated users over to the new node system.
@DavidM-D @itegulov @volovyks lmk what you guys think
@ChaoticTempest this process sounds resnable to me. I will try to summarise, tell me if I got it right. I will point at some issues I have found.
- Leader node should have a list of users to perform the rotation. From what I understood this problem is unsolved. Basically, it's a list if internal user ids, but I'm not sure if that is enough.
- Call a new node N times so it create a key share for each user.
- Call a new Node N times to get user PK share. (it will not work, because node will give PK only to the holder of the Id Token) (yeah, it's a new security "feature").
- Leader node will use old set of nodes to add new user combined key. And maybe delete the old one.
Discussed this with @volovyks offline. The way mentioned before for rotating user recovery keys on chain no longer really works without introducing some hacks like privileged endpoints that only leader node would be able to access (but that introduces potential attacks too). This is due to front running protection requiring oidc_token and (frp_sig, frp_pk) now for getting the recovery key.
To simplify this down, would it be possible to do a firebase db transfer from one account over to another @itegulov? We were thinking the following flow if the above holds true, for migrating off node A with node Z joining:
- Initiate transfer of firebase db from node A to node Z.
- node A shares the cipher that encrypts all user creds with node Z.
- node Z calls into the earlier implementation of node-key rotation to decrypt and re-encrypt with their new key share and corresponding cipher.
This way, the recovery key on chain is no longer needed. But I fear we will run into this issue again when we go from our current system to threshold signing. So, would threshold signing require us to change the recovery keys on chain too?
@DavidM-D @itegulov wdyt of all this? Also, how important would it be to include this migration feature in the release?
To simplify this down, would it be possible to do a firebase db transfer from one account over to another?
Yes, this is actually what I did during that manual migration I did a couple of months ago. Here is the python script I wrote and used, might be useful as a reference (assumes some downtime as you need to stop writing to the source DB):
from google.oauth2 import service_account
from google.cloud import datastore
credentials_dev = service_account.Credentials.from_service_account_file(
'../pagoda-discovery-platform-dev-92b300563d36.json')
client_dev = datastore.Client(project="pagoda-discovery-platform-dev", credentials=credentials_dev)
credentials_prod = service_account.Credentials.from_service_account_file(
'../pagoda-discovery-platform-prod-1f69134d6c22.json')
client_prod = datastore.Client(project="pagoda-discovery-platform-prod", credentials=credentials_prod)
print('Cleaning prod entities...')
query = client_prod.query(kind="EncryptedUserCredentials-mainnet")
keys = []
for entity in list(query.fetch()):
keys.append(entity.key)
client_prod.delete_multi(keys)
print('Fetching dev entities')
query = client_dev.query(kind="EncryptedUserCredentials-prod")
entities = []
for entity in list(query.fetch()):
entity.key = client_prod.key('EncryptedUserCredentials-mainnet').completed_key(entity.key.id_or_name)
print(entity.key)
print(entity)
entities.append(entity)
print("Uploading a total of " + str(len(entities)) + " entities")
client_prod.put_multi(entities)
So, would threshold signing require us to change the recovery keys on chain too?
So the recovery keys are going to remain constant as long as we everything is going well. The only reason we might want to rotate them if somehow we screwed up and the combined private key got exposed somehow, so still a very manual and last-resort process.
From my perspective this is something we can keep out of the upcoming release, but would be nice to have sooner rather than later just in case. Would late September be more of a reasonable goal?
My opinion is that we need to prepare for what is important rigtn now. DB transfer + node key rotation should be enough to be prepared for one of our partners leaving while we have current design. Let's solve transfer to treshold signing when it's required.
OK I don't have as much context as you guys so take what I say with a pinch of salt.
We certainly don't need this in the next release.
I'm fine with us having a leader only endpoint that gives us access to users public keys. The leader node already observes this on every sign. Obviously only do this if it makes our life easier.
It feels like Daniyar's migration script acts as a just good enough stopgap for the case of someone abruptly not wanting to be a node provider. Obviously we'd want to rotate the keys in short order, but it would keep things running in the interim.
Since this might never happen (before we migrate to Threshold) is this a good enough solution?
Exposed keys are a problem, but if somebody somehow exposes the private keys by the time we work it out it'll probably be too late. Do we simplify the engineering work by setting all the FAKs to the same "emergency" key in the event of a breach rather than doing a full key rotation?
@ChaoticTempest do you think it's too early to work on the rotation over to Threshold? If so we can just put this on the backburner.
We should remove features of this wherever possible but test whatever we do make well.
Would late September be more of a reasonable goal?
@itegulov I think that would be best especially with the offsite happening soon.
@ChaoticTempest do you think it's too early to work on the rotation over to Threshold? If so we can just put this on the backburner.
@DavidM-D yeah I fear the threshold signing might look a lot different than what I'm imagining, so I rather push the user-key rotation till when we have a threshold design.
Exposed keys are a problem, but if somebody somehow exposes the private keys by the time we work it out it'll probably be too late. Do we simplify the engineering work by setting all the FAKs to the same "emergency" key in the event of a breach rather than doing a full key rotation?
That's one way, but I'd like to avoid it as we'd be centralized to that one key right? It would still require coordinating with all the signer nodes to rotate their respective key share, so I'd prefer we just implement it correctly with new keys. Unless you're talking about some other "emergency" key that we should always have in each account, such that if the breach happens, we simply delete the MPC created key.
It feels like Daniyar's migration script acts as a just good enough stopgap for the case of someone abruptly not wanting to be a node provider. Obviously we'd want to rotate the keys in short order, but it would keep things running in the interim.
Since this might never happen (before we migrate to Threshold) is this a good enough solution?
I'll write a test for it, just so we can have more confidence on that script.