Polykey Prevent RPC calls for Unauthenticated `NodeConnection`s (Segregated Network Connections)

Specification

In order to segregate nodes of different networks using the ClaimNetworkAccess tokens defined in https://github.com/MatrixAI/Polykey/issues/779, there needs to be some logic for nodes to prevent accepting RPC calls from nodes that are out-of-network. There will need to be some RPC calls that are whitelisted to enable some level of out of network negotiation such as authenticating into the network or requesting access.

Currently a NodeConnection has two stages, the creating stage when you are awaiting the createNodeConnection factory. And the connected stage when the NodeConnection fully connects and the creation resolve. The problem with this is we need to be able to authenticate without allowing most RPC calls and nominal traffic.

To this end we need a connected but unauthenticated state where the NodeConnection is fully operational but not entertaining RPC calls by rejecting them. The NodeConnection.createNodeConnection should negotiate it's access to the network before resolving with the completed connection. This means the connecting state and authenticating state is hidden inside the creation of the NodeConnection. This will mean minimal changes to how the NodeConnectionManager handles these connections. However, the NodeConnection will need separate connected, authenticated and created events to reflect these stages.

Connection event should be recorded by the audit domain as a connection
Authenticated event should be recorded by the audit domain as well. Probably a new authenticated event. There will be a distinct forward and reverse authentication event.
Not actually sure about creation, fills the same role as authenticated so subject to discussion.

While in the authenticating state, all non whitelisted RPC calls need to be rejected outright. That said, the NodeConnection shouldn't be available to make these calls at this stage but we still need to be secure about it. To avoid having to add logic to all of the RPC handlers we should apply this connection rejection logic to the middleware. The middleware will need to refer back to the NodeConnection somehow to assess the authenticated state.

To authenticate the NodeConnection needs to make an authentication RPC call and provide a valid network token. It's up to the handler of this to decide to reject the authentication with an error and kill the NodeConnection if it fails. Note that this needs to be symmetric in the forward and reverse direction. BOTH sides of the connection need to fully authenticate. This opens us to annoying race conditions so we need to be extra careful here. The handshake has been described within #779.

It is extremely important that the following conditions are met.

A NodeConnection is only fully created if it fully authenticates.
The NodeConnection is only added to the NodeConnectionManager's connection map if it fully authenticates.
The connection details are only added to the NodeGraph if it fully authenticates.
No non whitelisted RPC calls are allowed or made during the Authenticating state.

Additional context

Related: #779 - Defines the network access tokens and how they are verified. Related: #770 - Parent issue

Tasks

Add an authentication state to the NodeConnection.
Add middleware logic to RPC to only allow non whitelisted RPC calls to be handled if we are in the authenticated state. Otherwise kill the stream involved.
Add an RPC handler authentication the connections access to the network. This will resolve with no message, or throw an error for why it failed to authenticate. This should be whitelisted.
Review and expand on the creation events to include separate connection, authentication and creation events.
NodeConnection must switch to the authenticated state and fully create only after both the authentication handler and call succeeds and resolves.
We need to make sure timeouts work here. There are two levels of timeouts. It's up to the handler side to kill the connection if it fails, but the calling side can timeout and kill the connection as a fail-safe. We also need an overall timeout for the whole process so if it fails to authenticate fully before the timeout then we just give up and kill the connection.

Jul 29 '24 06:07 amydevs

ENG-373 Network Awareness of `NodeManager`

Jul 29 '24 06:07 linear[bot]

@amydevs This will be easier to start with if the token logic is bending your mind. The only point of contact with the token logic here is the authentication logic utility function and the token payload. You should stub both of these out in testing.

Jul 30 '24 04:07 tegefaulkes

What's the status of this? @amydevs @tegefaulkes

Sep 03 '24 02:09 CMCDragonkai

This still needs to be worked on. I'll be taking it over while @amydevs starts on the PKE work with the DB domain.

Sep 03 '24 02:09 tegefaulkes

With #775 being merged the underlying data structure for separating networks has been implemented. This issue will focus on Managing the connections in a way to keep the networks separate. It will also only allow certain RPC calls during the connection's unauthenticated state.

When this issue is completed then we should have everything implemented to separate public networks. Further expansion to the authentication token logic will need to be done to allow for private networks later down the line.

Sep 03 '24 02:09 tegefaulkes

After handing this of to Brian, we discussed over several options to implement this.

The most intuitive way would be to make sure that the static createNodeConnection function awaits for the authentication process to finish before returning with a NodeConnection.

However, this presents problems:

As the NodeConnectionManager own the RPCServer, as well as makes calls to createNodeConnection. The resolution of createNodeConnection promise depends on a handled call by the RPCServer. However, at that point, the NodeConnection has not been created yet, so there is no conceivable way to notify the createNodeConnection method call that authentication has finished.
In order to keep this data model, we would need to have the NodeConnection be available during the handling of the RPC call that authenticates the peer. This is not possible as the RPCServer instance is established per NodeConnectionManager rather than per NodeConnection.
The only other conceivable way to fix this is to pass an async callback/promise to the createNodeConnection that will resolve once the connection has been authenticated.

Sep 03 '24 03:09 amydevs

I was thinking that node connections is a lower level, and just do your gated calls at the RPC layer. Remember it's an application layer concern that they aren't on the same network. You have to check their sigchain claim. I don't think node connections being a lower level concern should even be aware of this problem. Factor out the abstraction to solve this.

Sep 03 '24 03:09 CMCDragonkai

I was thinking that node connections is a lower level, and just do your gated calls at the RPC layer. Remember it's an application layer concern that they aren't on the same network. You have to check their sigchain claim. I don't think node connections being a lower level concern should even be aware of this problem. Factor out the abstraction to solve this.

Importantly don't mix up abstraction layers. Otherwise the entanglement will cause modularity problems in the future.

Sep 03 '24 03:09 CMCDragonkai

I was hoping at the time to isolate the changes to the NodeConnection so we wouldn't need to modify the NodeConnectionManager at all. After discussing this with @amydevs and mulling over it. It seems that the best place to manage this logic will be in the NodeConnectionManager itself.

This means that the NCM will coordinate the NC authentication process and only make the connection available to be used after that has been completed. We should be able to get away with not allowing most RPC calls since we can prevent access to the NC in question until it has been verified to be part of the network.

Sep 03 '24 07:09 tegefaulkes

Can you disallow ALL RPC calls? Usually RPC authentication would be an RPC middleware. But if you put it into NCM, then you'd need to call the RPC in the NCM. That would also mean NCM ends up having knowledge about sigchain claims. I feel like that's too much knowledge built into NCM. Seems like an NM sort of thing to know. Remember NCM is just NCs, but NM can do higher level semantic operations like knowing about the sigchain and making interpretations on it. NM seems like a better place for all this logic.

Sep 03 '24 07:09 CMCDragonkai

We can, but there are two aspects to it.

If we can't access the node connection from the NCM then we can't make the call in the first place. That said we shouldn't entertain receiving them either.
We need to all for some RPC calls to allow negotiating the authenticating stage or requesting access to the network.

Sep 03 '24 08:09 tegefaulkes

I don't know what you mean by 1. But I thought we can make node connections established simply because of quic. Then sigchain has to be checked at a higher abstraction layer?

Sep 03 '24 08:09 CMCDragonkai

That's what we're doing, yes. There are two levels of authenticating the connection. The normal connection level done by QUIC which will be unchanged. And the higher level where we negotiate authentication for the network. 1. Is only saying that we can maybe get away without disabling the RPC calls since we can't make them without access to the NodeConnection and we can't get the NodeConnection from the NCM without it being authenticated.

But since we can't control when the reverse RPC requests can be made then we'd probably have to enforce it for the handlers via the reverse middleware anyway. May has well handle it for both directions at that point.

Sep 04 '24 01:09 tegefaulkes

When this is merged - that means testnet and mainnet is separated.

Sep 09 '24 00:09 CMCDragonkai

Status of this?

Sep 29 '24 18:09 CMCDragonkai

I've made some progress but it's put on hold until I get the first milestone of PKE beta done.

Sep 30 '24 03:09 tegefaulkes

Was anything changed in the spec with respect to implementation in #804?

Feb 06 '25 15:02 CMCDragonkai

Only three deviations from the spec.

I haven't added specific forward or and reverse authenticated events and authentication isn't being added to the audit domain.
The connections exist within the connection map weather it's been authenticated or not. But can only be used outside of the NodeConnectionManager if it's been authenticated.
The private network authentication logic hasn't been implemented yet. We have a simple mechanism that checks the network name. This doesn't use any cryptography but is sufficient for public networks.

Feb 06 '25 23:02 tegefaulkes

Is 1. and 2. a problem?

Feb 06 '25 23:02 CMCDragonkai

having authenticated audit events isn't needed right now. It will be useful later down the line for PKE.
Not a problem, it simplified the logic. NCM will need to access connections to make calls like authenticating the connection. Later down the line other whitelisted RPC calls will be allowed.

Feb 06 '25 23:02 tegefaulkes