Polykey
Polykey copied to clipboard
Updating remote node information after root keypair and `NodeID` has changed.
Specification
The root certificate of nodes and by extension their root key pair and NodeID
is allowed to change. There are 3 methods in the KeyManager
that does this;
-
renewRootKeyPair
, Generates a new root keyPair, adds it to the cert chain and is signed by the previous certificate. This creates a newNodeId
ensures that the oldNodeId
is still valid. -
resetRootKeyPair
, Throws out the old cert chain and generates a new on from scratch. The new cert is self signed. This is a newNodeId
and the old one is now Invalid. -
resetRootCert
. This keeps the old root keypair but generates the cert chain from scratch. This means theNodeId
remains the same but the cert chain is a new one from scratch.
What we want to handle is if a nodes starts using a new NodeId
but it's old NodeId
remains valid within it's cert chain. When dealing with these nodes we need to treat it as the valid old NodeId
while updating any information we have about it to use it's new NodeId
.
This means we will have to check the full cert chain when connecting to a node. Determine if the NodeId
we're expecting is an older node in the cert chain. If it is older we need to obtain the new NodeId
of the node and start using that.
The following domains need to be updated to handle the NodeId
of a remote node changing.
-
sigchain
This stores any claims between nodes. I think a new claim would need to be generated and reference the old one. but this wlll need to be looked into. -
GestaltGraph
This keeps track of nodes within their gestalts. If a node's ID changes then it needs to be updated here. -
ACL
stores permissions allowed for each node. If a node's ID changes then it needs to be updated here. -
NodeGraph
, I think that the new node will be added to the NodeGraph automatically through the usual operation. We may need to remove the old NodeId from the graph. Since the node can be contacted via all it's NodeIds in the cert chain it's possible for a node to exist in the graph as multipleNodeId
s. So we only want to keep the newestNodeId
for the node.
It's also possible that when a node's ID changes it's old NodeId
and information will slowly exit the network as it is replaced with the new NodeId
. As a result the node will become un-discoverable with the old ID though it may be unlikely. We may need a way to discover a node ID via it's old nodeId by having nodes keep information about the ID change when we update our information on it. That way we can inform other nodes about the change when they come looking for it.
Additional context
- Relates to #317
- #403 - auto configuration of the seed cluster also has to deal with
NodeId
changes - #347 - never figured out the core problem for test failures during key renewal
- #444 - Main big issue about considering observable and push-based dataflow, however that's about inter-domain reactivity, whereas this is about node to node reactivity
Tasks
- Add a way to detect when a nodes root certificate, key and ID has changed when we connect to it.
- Update any information about the node when we detect the change. These changes will need to be applied to the following domains.
-
SigChain
-
GestaltGraph
-
ACL
-
NodeGraph
-
- Add ability to make note of the change to inform other nodes of the change.
I actually still don't like the fact that we get our NodeId
from this.keyManager.getNodeId()
.
Obviously the KeyManager
is the SoT for the root key, but it has a code smell.
A Node
is an abstraction around the KeyManager
, therefore the KeyManager
should be supplying its functionality to a Node
class or something representing the current node.
That class can then provide some of the common functionality required...
But NodeId is a derived property, and is likely to change over time. We would need things to instead be "observing" the NodeId value.
I think some research into observables might be relevant.
You can get the current Node Id, or you can register to observe the NodeId. I imagine some changes to EventBus
may be relevant too.
This then allows our domain models to be reactive to change. Right now they are only reactive to function calls.
Consider the Push/Pull concept:
Data Flow
┌─────────────────────────────────────┐
│ │
┌───────┴───────┐ ┌────────▼───────┐
│ │ │ │
│ Desired State ├────────Push────────► Realised State │
│ │ │ │
└───────┬───────┘ └────────▲───────┘
│ │
└─────────────────────────────────────┘
Control Flow
Data Flow
┌─────────────────────────────────────┐
│ │
┌───────┴───────┐ ┌────────▼───────┐
│ │ │ │
│ Desired State ├────────Pull────────► Realised State │
│ │ │ │
└───────▲───────┘ └────────┬───────┘
│ │
└─────────────────────────────────────┘
Control Flow
We could setup a push based reactivity where NodeId changes pushes out to the other domains. And this may apply to several areas in the code.
In #466, I've renamed these methods to (in CertManager
):
renewCertWithNewKeyPair
resetCertWithNewKeyPair
resetCertWithExistingKeyPair
These names are more descriptive as to the intention of what these are meant to do.
During operational usage, only renewCertWithNewKeyPair
is used. The other 2 are more used for diagnostics and debugging.
The user may decide however to "break" the chain... and use the the other 2 functions.
If they do, we have to plan for how this impacts the sigchain. It may result in a breaking the sigchain since all previous claims are no longer valid or can be verified.
Specifically resetCertWithNewKeyPair
basically means there's no cryptographic link from the previous key node identity to the new identity. It's almost like starting fresh in the identity space of Polykey.
However resetCertWithExistingKeyPair
is curious. Because although you break the link with previous identities. You do you have still retain the current identity, which means, any claims signed by the current identity is still valid.
We may need to be able to automatically garbage collect invalid claims on the sigchain... or we still preserve this data, but make the consumers robust against this sort of situation by ignoring claims are that are no longer cryptographically valid.
See: https://github.com/MatrixAI/Polykey/pull/446#issuecomment-1275712892 for discussion regarding what happens during certificate renewal.
There are many domains that currently don't work when the NodeId
changes. For example the sigchain creates a claim ID generator, the generator must be updated with the new NodeId
when it has been changed. To ensure this occurs, we basically have to subscribe to these changes. However lacking rxjs integration right now, we have to do this with an explicit update function on these objects, and then have this be set up in the PolykeyAgent
on the cert change data.
this.events.on(
PolykeyAgent.eventSymbols.CertManager,
async (data: CertManagerChangeData) => {
this.logger.info(`${KeyRing.name} change propagating`);
await this.status.updateStatusLive({
nodeId: data.nodeId,
});
await this.nodeManager.resetBuckets();
// Update the sigchain
await this.sigchain.onKeyRingChange();
const tlsConfig: TLSConfig = {
keyPrivatePem: keysUtils.privateKeyToPEM(
data.keyPair.privateKey,
),
certChainPem: await this.certManager.getCertPEMsChainPEM(),
};
this.grpcServerClient.setTLSConfig(tlsConfig);
this.proxy.setTLSConfig(tlsConfig);
this.logger.info(`${KeyRing.name} change propagated`);
},
);
For now, this is a short term fix until we properly integrate rxjs observables into our class objects.
It makes more sense for these objects to subscribe to the specific data. And for the ability to "emit" change events to their properties in the realm of the domain objects themselves.
So in the future I expect Sigchain
to subscribe to a property like this.keyRing.nodeId$
.
About the $
suffix naming convention: https://stackoverflow.com/questions/70053021/should-i-use-sign-suffix-for-subjects-in-the-rxjs-and-angular
@tegefaulkes can you confirm if #552 will end up fixing this issue? If so, change #552 to fix this issue instead of related.
I think even after this is technically done, extensive testing would need to be done to ensure that this change propagation is in fact robust and also works across the network. But we are likely get there only at testnet 7 rather than testnet 6 #551
Relevant?
I think it's still relevant. No direct work has been done to address this. We can handle a node when it's ID has changed but currently we don't really update anything about the ID change.
This has changed from using the EventBus
to using js-events
where all the domains are evented. However this behaviour has not been verified yet. So PK nodes atm cannot handle nodes that it has previously connected to, but the NodeId
has changed?
I was refactoring NodeGraph
in https://github.com/MatrixAI/Polykey/pull/618, and I noticed the usage of this.keyRing.getNodeId()
alot. While this allows NodeGraph
to always get the latest NodeId
, of the node, it also means it's pull oriented and also requires a separate system to remember to call NodeGraph.resetBuckets
when the node's own NodeId
changes.
It's also a bit awkward to require a huge dependency on the KeyRing
only to get the NodeId
. This seems like a code smell. At the very least we should be able to decouple the node's own NodeId
as a sort of reactive property that is driven by changes in the system. So it's something to think about.