swarm/: Support generic connection management
Description
Today a user can:
- Set a limit for incoming/outgoing/pending connections, global and per peer.
- Close a connection via
NetworkBehaviour.- Keep a connection alive via
ConnectionHandler::keep_alive.A user is restricted in (1) as they can not do anything beyond setting upper bound limits. E.g. they can not decide whether to accept or deny an incoming connection based on the IP, the
PeerIdor on their current CPU or memory utilization.
(Taken from https://github.com/libp2p/rust-libp2p/discussions/2118#discussioncomment-3134752.)
I see 3 ways how we can enable users to do advanced connection management:
- Make the
pool::ConnectionCountergeneric and allow users to provide their own.- Yet another moving piece.
- When not
Boxed requires an additional generic parameter. - Can only do connection management. Not as powerful as the
NetworkBehaviourtrait and thus limited in its decision making.
- Extend
NetworkBehaviourwith methods called to review a pending or established connection.- Yet another set of methods on
NetworkBehaviour. - Has knowledge of the full state, thus allows powerful decision making.
- Users are already familiar with it.
- Yet another set of methods on
- Emit a
SwarmEvent::ReviewConnectionrequiring the user to manually accept each pending/established connection via e.g.Swarm::accept().- Complicates the getting-started experience. User has to explicitly call a method to make progress. Not intuitive.
Motivation
Allows downstream users to do advanced connection management and can simplify existing implementations working around this today.
- https://github.com/sigp/lighthouse/tree/stable/beacon_node/lighthouse_network/src/peer_manager
- https://github.com/paritytech/substrate/blob/master/client/peerset/src/lib.rs
- https://github.com/n0-computer/iroh/tree/main/iroh-p2p
Are you planning to do it yourself in a pull request?
Yes
I am currently working on proposal (2), i.e. extending NetworkBehaviour. Hope to publish a first proof-of-concept this week.
How is (2) going to work in terms of composing behaviours? One NetworkBehaviour declining a connection means it will be denied? i.e. Need full agreement across all NetworkBehaviours?
How is (2) going to work in terms of composing behaviours? One
NetworkBehaviourdeclining a connection means it will be denied? i.e. Need full agreement across allNetworkBehaviours?
Correct.
Draft implementation of (2) is happening on https://github.com/libp2p/rust-libp2p/pull/2828. Still work in progress, but might help understand the idea behind (2). Happy to expand on any of this in written form. No need to dig through the code in case you don't want to.
Going to start on this once https://github.com/libp2p/rust-libp2p/pull/3011 is merged to avoid the churn.
I think we should design this with a concrete use-case in mind. @divagant-martian would you volunteer overseeing this from the lighthouse user perspective?
@thomaseizinger can you keep @divagant-martian in the loop?
@mxinden for sure! I'm also relatively familiar with substrate's peer management and can probably get in touch with the iroh guys to check what needs we have in common and where do those differ
NetworkBehaviour already has quite a few callbacks, although we are getting rid of some of them in #3011. Thus, I am a bit hesitant to continue in the direction of #2828.
So here is an alternative idea:
Make IntoConnectionHandler fallible
NetworkBehaviour::new_handler is already invoked for every incoming connection. If a NetworkBehaviour would like to impose a policy on which connections should be established, it can pass the required data to a prototype implementation of IntoConnectionHandler and make the decision as soon as into_handler is called with the PeerId etc.
There are a few drawbacks with the current design but I think those are all fixable:
- If somehow possible, I'd like to get rid of
IntoConnetionHandlerand delay callingnew_handleruntil we have established the connection. This would allows us to directly makenew_handlerfallible which would make this feature much easier to use. Additionally, it means we won't run into race conditions with simple limit connection managers asnew_handleris called with&mut selfand thus, will be called sequentially for each connection. NetworkBehaviourAction::Dialcurrently also emits an entire handler. To make sure connection limits are enforced in a single place, I'd replace this mechanism with an "OpenInfo" style whereDialcan carry some "OpenInfo" struct that is passed back innew_handler.
I see the following benefits of this design:
- Good use of the type system: We need a
ConnectionHandlerfor each connection. If we fail to construct one, we can never actually establish the connection. - Synchronous decision making: Async decision making would require some kind of timeout so we would potentially hold every connection for some amount of time before we can consider it to be established.
- Small API surface: Very little change to the existing API of
NetworkBehaviour.
cc @divagant-martian @dignifiedquire @rkuhn @mxinden
The proposal would work very well for us and save a lot of time spent on disconnecting peers we were not interested in having in the first place :+1:
- f somehow possible, I'd like to get rid of
IntoConnetionHandlerand delay callingnew_handleruntil we have established the connection. This would allows us to directly makenew_handlerfallible which would make this feature much easier to use. Additionally, it means we won't run into race conditions with simple limit connection managers asnew_handleris called with&mut selfand thus, will be called sequentially for each connection.
I am in favor of getting rid of IntoConnectionHandler as in my eyes it adds a lot of complexity.
NetworkBehaviourAction::Dialcurrently also emits an entire handler. To make sure connection limits are enforced in a single place, I'd replace this mechanism with an "OpenInfo" style whereDialcan carry some "OpenInfo" struct that is passed back innew_handler.
It would be wonderful to have a single mechanism only to create a new handler.
Synchronous decision making: Async decision making would require some kind of timeout so we would potentially hold every connection for some amount of time before we can consider it to be established.
In my eyes this is the way to go. We also achieved consensus on this in https://github.com/libp2p/rust-libp2p/discussions/2118#discussioncomment-3134752.
I've put up a draft PR here that aims to build the foundation for the above idea.
The proposal would work very well for us and save a lot of time spent on disconnecting peers we were not interested in having in the first place +1
A note on this: From the other party's perspective, it is still a regular disconnect, it just happens automatically within the Swarm.
We do still need a mechanism for a NetworkBehaviour to allow or deny pending inbound or outbound connections.
(In https://github.com/libp2p/rust-libp2p/pull/2828 that happens via new methods on NetworkBehaviour.)
@mxinden wouldn't it be enough making the new handler function fallible? I as I understand this is part of the proposal. Am I missing something?
Unless I am missing something, the only thing not possible with my proposal is blocking a new incoming connection before the upgrade process is finished.
For any policy that purely operates on the number of connections, these upgrades would be wasted (plus they consume resources).
We could retain a prototype-based system where creating a handler is a two-step process for inbound connections:
- Replace
new_inbound_handlerwithon_new_pending_connectionthat returns aResult. - The
Okof said result is a handler prototype. - The handler prototype has a fallible conversion into the actual handler.
The first failure point can be used for pending inbound connections before the upgrade. The second failure point can be used for established connections and policies like banning by peer ID.
Right, that makes sense and would tackle both cases. Would that be reasonable to add to your current work @thomaseizinger or do you think it should be an effort for a second iteration?
The second failure point can be used for established connections and policies like banning by peer ID.
But that second failure point needs some synchronization mechanism back to a single point of truth, likely living in the NetworkBehaviour, i.e. some central place e.g. counting the number of inbound connections.
Removing the IntoConnectionHandler indirection, adding a NetworkBehaviour::on_pending_connection and returning a ConnectionHandler via NetworkBehaviour::new_handler would resolve the need for a synchronization mechanism.
The second failure point can be used for established connections and policies like banning by peer ID.
But that second failure point needs some synchronization mechanism back to a single point of truth, likely living in the
NetworkBehaviour, i.e. some central place e.g. counting the number of inbound connections.
Counting policies would use the first failure point, data driven policies like allow/ban lists would use the latter one.
List-based policies don't need synchronisation.
Removing the
IntoConnectionHandlerindirection, adding aNetworkBehaviour::on_pending_connectionand returning aConnectionHandlerviaNetworkBehaviour::new_handlerwould resolve the need for a synchronization mechanism.
The argument for synchronisation with NB stems from it having more knowledge right? But a pending connection doesn't provide any information apart from the incoming multiaddress. This makes me think that policies that need to take into account pending connections are likely going to be resource-based, i.e. RAM usage, number of connections etc.
Pushing these policies into NB doesn't feel right.
- NBs are composed but resource-based policies are likely global.
- The owner of
Swarmlikely knows more how constrained the resources are.
This makes me think that policing pending connections should either happen in the Swarm or some other, non-composed module that a Swarm can be configured with. Perhaps it should even happen on a Transport level as a Transport wrapper?
I've put up a draft PR here that aims to build the foundation for the above idea.
Removing IntoConnectionHandler is proving difficult (see PR comments).
I am tempted to build a version of this that doesn't remove IntoConnectionHandler but makes the conversion fallible instead.
@divagant-martian Would that be sufficient for your usecase?
A connection management behaviour would have to create ConnectionHandler prototypes that have the necessary data embedded to make a decision about the incoming connection.
For example, if you have a list of banned peers, you'd copy this list into the prototype upon constructing it in new_handler.
I put up a draft PR here for what this could look like: https://github.com/libp2p/rust-libp2p/pull/3118
I now have a fully working version of the current codebase without the IntoConnectionHandler abstraction: https://github.com/libp2p/rust-libp2p/pull/3099
I think this can be merged as an independent improvement. We can then later decide how we want to manage pending connections.
We had multiple requests for rust-libp2p to be able to prevent dialing any private IP addresses. Thus far we have pointed them to the following Transport wrapper:
https://github.com/mxinden/kademlia-exporter/blob/master/src/exporter/client/global_only.rs
Somehow I was operating with the assumption that we can now use the generic NetworkBehaviour connection management system to implement the above. More specifically one would return an Error from NetworkBehaviour::handle_pending_outbound_connection.
Unfortunately this is not quite possible. Say that a NetworkBehaviour emits a dial request with one private and one global IP address. The connection-management NetworkBehaviour can either deny all or allow all, but not deny the private and all the global via NetworkBehaviour::handle_pending_outbound_connection.
Should we support the above use-case, e.g. by returning both a set of additional addresses and a set of blocked addresses from NetworkBehaviour::handle_pending_outbound_connection? Or should we just continue recommending the above Transport wrapper, thus blocking private IP addresses at the Transport layer instead.
I am leaning towards the latter, i.e. not extend the NetworkBehaviour::handle_pending_outbound_connection but instead provide the GlobalIpOnly Transport implementation from the rust-libp2p mono repository.
@thomaseizinger any thoughts on this?
That is something that would be nice to support. It is a bit tricky though.
Even if we pass a mutable reference of the existing addresses into the behaviour, you can never guarantee that another behaviour doesn't add more non-global addresses.
How effective such a global-ip address management is depends on how you compose your behaviours which is quite the foot-gun.
In this case, refusing the dial on the Transport layer is much more effective.
Maybe we've mixed too many concerns here by merging addresses_of_peer into handle_pending_outbound_connection. If we were to undo that, we could gather all addresses before and then pass a custom collection type to handle_pending_outbound_connection that only allows removal and not adding new addresses. (a &mut [Multiaddr] would be too permissive).
Agreed with the concerns above. I suggest continuing with the custom Transport implementation for now.