Polykey Updating remote node information after root keypair and `NodeID` has changed.

Specification

The root certificate of nodes and by extension their root key pair and NodeID is allowed to change. There are 3 methods in the KeyManager that does this;

renewRootKeyPair, Generates a new root keyPair, adds it to the cert chain and is signed by the previous certificate. This creates a new NodeId ensures that the old NodeId is still valid.
resetRootKeyPair, Throws out the old cert chain and generates a new on from scratch. The new cert is self signed. This is a new NodeId and the old one is now Invalid.
resetRootCert. This keeps the old root keypair but generates the cert chain from scratch. This means the NodeId remains the same but the cert chain is a new one from scratch.

What we want to handle is if a nodes starts using a new NodeId but it's old NodeId remains valid within it's cert chain. When dealing with these nodes we need to treat it as the valid old NodeId while updating any information we have about it to use it's new NodeId.

This means we will have to check the full cert chain when connecting to a node. Determine if the NodeId we're expecting is an older node in the cert chain. If it is older we need to obtain the new NodeId of the node and start using that.

The following domains need to be updated to handle the NodeId of a remote node changing.

sigchain This stores any claims between nodes. I think a new claim would need to be generated and reference the old one. but this wlll need to be looked into.
GestaltGraph This keeps track of nodes within their gestalts. If a node's ID changes then it needs to be updated here.
ACL stores permissions allowed for each node. If a node's ID changes then it needs to be updated here.
NodeGraph, I think that the new node will be added to the NodeGraph automatically through the usual operation. We may need to remove the old NodeId from the graph. Since the node can be contacted via all it's NodeIds in the cert chain it's possible for a node to exist in the graph as multiple NodeIds. So we only want to keep the newest NodeId for the node.

It's also possible that when a node's ID changes it's old NodeId and information will slowly exit the network as it is replaced with the new NodeId. As a result the node will become un-discoverable with the old ID though it may be unlikely. We may need a way to discover a node ID via it's old nodeId by having nodes keep information about the ID change when we update our information on it. That way we can inform other nodes about the change when they come looking for it.

Additional context

Relates to #317
#403 - auto configuration of the seed cluster also has to deal with NodeId changes
#347 - never figured out the core problem for test failures during key renewal
#444 - Main big issue about considering observable and push-based dataflow, however that's about inter-domain reactivity, whereas this is about node to node reactivity

Tasks

Add a way to detect when a nodes root certificate, key and ID has changed when we connect to it.
Update any information about the node when we detect the change. These changes will need to be applied to the following domains.
1. SigChain
2. GestaltGraph
3. ACL
4. NodeGraph
Add ability to make note of the change to inform other nodes of the change.

Jun 21 '22 04:06 tegefaulkes

I actually still don't like the fact that we get our NodeId from this.keyManager.getNodeId().

Obviously the KeyManager is the SoT for the root key, but it has a code smell.

A Node is an abstraction around the KeyManager, therefore the KeyManager should be supplying its functionality to a Node class or something representing the current node.

That class can then provide some of the common functionality required...

But NodeId is a derived property, and is likely to change over time. We would need things to instead be "observing" the NodeId value.

I think some research into observables might be relevant.

You can get the current Node Id, or you can register to observe the NodeId. I imagine some changes to EventBus may be relevant too.

This then allows our domain models to be reactive to change. Right now they are only reactive to function calls.

Consider the Push/Pull concept:

                      Data Flow
        ┌─────────────────────────────────────┐
        │                                     │
┌───────┴───────┐                    ┌────────▼───────┐
│               │                    │                │
│ Desired State ├────────Push────────► Realised State │
│               │                    │                │
└───────┬───────┘                    └────────▲───────┘
        │                                     │
        └─────────────────────────────────────┘
                    Control Flow



                      Data Flow
        ┌─────────────────────────────────────┐
        │                                     │
┌───────┴───────┐                    ┌────────▼───────┐
│               │                    │                │
│ Desired State ├────────Pull────────► Realised State │
│               │                    │                │
└───────▲───────┘                    └────────┬───────┘
        │                                     │
        └─────────────────────────────────────┘
                    Control Flow

We could setup a push based reactivity where NodeId changes pushes out to the other domains. And this may apply to several areas in the code.

Aug 01 '22 13:08 CMCDragonkai

In #466, I've renamed these methods to (in CertManager):

renewCertWithNewKeyPair

resetCertWithNewKeyPair

resetCertWithExistingKeyPair

These names are more descriptive as to the intention of what these are meant to do.

During operational usage, only renewCertWithNewKeyPair is used. The other 2 are more used for diagnostics and debugging.

The user may decide however to "break" the chain... and use the the other 2 functions.

If they do, we have to plan for how this impacts the sigchain. It may result in a breaking the sigchain since all previous claims are no longer valid or can be verified.

Specifically resetCertWithNewKeyPair basically means there's no cryptographic link from the previous key node identity to the new identity. It's almost like starting fresh in the identity space of Polykey.

However resetCertWithExistingKeyPair is curious. Because although you break the link with previous identities. You do you have still retain the current identity, which means, any claims signed by the current identity is still valid.

We may need to be able to automatically garbage collect invalid claims on the sigchain... or we still preserve this data, but make the consumers robust against this sort of situation by ignoring claims are that are no longer cryptographically valid.

Oct 10 '22 06:10 CMCDragonkai

See: https://github.com/MatrixAI/Polykey/pull/446#issuecomment-1275712892 for discussion regarding what happens during certificate renewal.

Oct 12 '22 07:10 CMCDragonkai

There are many domains that currently don't work when the NodeId changes. For example the sigchain creates a claim ID generator, the generator must be updated with the new NodeId when it has been changed. To ensure this occurs, we basically have to subscribe to these changes. However lacking rxjs integration right now, we have to do this with an explicit update function on these objects, and then have this be set up in the PolykeyAgent on the cert change data.

      this.events.on(
        PolykeyAgent.eventSymbols.CertManager,
        async (data: CertManagerChangeData) => {
          this.logger.info(`${KeyRing.name} change propagating`);
          await this.status.updateStatusLive({
            nodeId: data.nodeId,
          });
          await this.nodeManager.resetBuckets();
          // Update the sigchain
          await this.sigchain.onKeyRingChange();
          const tlsConfig: TLSConfig = {
            keyPrivatePem: keysUtils.privateKeyToPEM(
              data.keyPair.privateKey,
            ),
            certChainPem: await this.certManager.getCertPEMsChainPEM(),
          };
          this.grpcServerClient.setTLSConfig(tlsConfig);
          this.proxy.setTLSConfig(tlsConfig);
          this.logger.info(`${KeyRing.name} change propagated`);
        },
      );

For now, this is a short term fix until we properly integrate rxjs observables into our class objects.

It makes more sense for these objects to subscribe to the specific data. And for the ability to "emit" change events to their properties in the realm of the domain objects themselves.

So in the future I expect Sigchain to subscribe to a property like this.keyRing.nodeId$.

Oct 15 '22 12:10 CMCDragonkai

About the $ suffix naming convention: https://stackoverflow.com/questions/70053021/should-i-use-sign-suffix-for-subjects-in-the-rxjs-and-angular

Oct 15 '22 12:10 CMCDragonkai

@tegefaulkes can you confirm if #552 will end up fixing this issue? If so, change #552 to fix this issue instead of related.

Sep 12 '23 08:09 CMCDragonkai

I think even after this is technically done, extensive testing would need to be done to ensure that this change propagation is in fact robust and also works across the network. But we are likely get there only at testnet 7 rather than testnet 6 #551

Sep 12 '23 08:09 CMCDragonkai

Relevant?

Oct 18 '23 06:10 CMCDragonkai

I think it's still relevant. No direct work has been done to address this. We can handle a node when it's ID has changed but currently we don't really update anything about the ID change.

Oct 18 '23 23:10 tegefaulkes

This has changed from using the EventBus to using js-events where all the domains are evented. However this behaviour has not been verified yet. So PK nodes atm cannot handle nodes that it has previously connected to, but the NodeId has changed?

Nov 05 '23 23:11 CMCDragonkai

I was refactoring NodeGraph in https://github.com/MatrixAI/Polykey/pull/618, and I noticed the usage of this.keyRing.getNodeId() alot. While this allows NodeGraph to always get the latest NodeId, of the node, it also means it's pull oriented and also requires a separate system to remember to call NodeGraph.resetBuckets when the node's own NodeId changes.

It's also a bit awkward to require a huge dependency on the KeyRing only to get the NodeId. This seems like a code smell. At the very least we should be able to decouple the node's own NodeId as a sort of reactive property that is driven by changes in the system. So it's something to think about.

Nov 10 '23 18:11 CMCDragonkai

Polykey Polykey copied to clipboard

Updating remote node information after root keypair and `NodeID` has changed.

Specification

Additional context

Tasks

Polykey
Polykey copied to clipboard