rust-libp2p swarm/: Support generic connection management

Description

Today a user can:

Set a limit for incoming/outgoing/pending connections, global and per peer.

Close a connection via NetworkBehaviour.

Keep a connection alive via ConnectionHandler::keep_alive.

A user is restricted in (1) as they can not do anything beyond setting upper bound limits. E.g. they can not decide whether to accept or deny an incoming connection based on the IP, the PeerId or on their current CPU or memory utilization.

(Taken from https://github.com/libp2p/rust-libp2p/discussions/2118#discussioncomment-3134752.)

I see 3 ways how we can enable users to do advanced connection management:

Make the pool::ConnectionCounter generic and allow users to provide their own.
- Yet another moving piece.
- When not Boxed requires an additional generic parameter.
- Can only do connection management. Not as powerful as the NetworkBehaviour trait and thus limited in its decision making.
Extend NetworkBehaviour with methods called to review a pending or established connection.
- Yet another set of methods on NetworkBehaviour.
- Has knowledge of the full state, thus allows powerful decision making.
- Users are already familiar with it.
Emit a SwarmEvent::ReviewConnection requiring the user to manually accept each pending/established connection via e.g. Swarm::accept().
- Complicates the getting-started experience. User has to explicitly call a method to make progress. Not intuitive.

Motivation

Allows downstream users to do advanced connection management and can simplify existing implementations working around this today.

https://github.com/sigp/lighthouse/tree/stable/beacon_node/lighthouse_network/src/peer_manager
https://github.com/paritytech/substrate/blob/master/client/peerset/src/lib.rs
https://github.com/n0-computer/iroh/tree/main/iroh-p2p

Are you planning to do it yourself in a pull request?

Yes

Aug 17 '22 09:08 mxinden

I am currently working on proposal (2), i.e. extending NetworkBehaviour. Hope to publish a first proof-of-concept this week.

Aug 17 '22 09:08 mxinden

How is (2) going to work in terms of composing behaviours? One NetworkBehaviour declining a connection means it will be denied? i.e. Need full agreement across all NetworkBehaviours?

Aug 17 '22 14:08 thomaseizinger

How is (2) going to work in terms of composing behaviours? One NetworkBehaviour declining a connection means it will be denied? i.e. Need full agreement across all NetworkBehaviours?

Correct.

Aug 18 '22 09:08 mxinden

Draft implementation of (2) is happening on https://github.com/libp2p/rust-libp2p/pull/2828. Still work in progress, but might help understand the idea behind (2). Happy to expand on any of this in written form. No need to dig through the code in case you don't want to.

Aug 19 '22 04:08 mxinden

Going to start on this once https://github.com/libp2p/rust-libp2p/pull/3011 is merged to avoid the churn.

Nov 04 '22 03:11 thomaseizinger

I think we should design this with a concrete use-case in mind. @divagant-martian would you volunteer overseeing this from the lighthouse user perspective?

@thomaseizinger can you keep @divagant-martian in the loop?

Nov 04 '22 11:11 mxinden

@mxinden for sure! I'm also relatively familiar with substrate's peer management and can probably get in touch with the iroh guys to check what needs we have in common and where do those differ

Nov 04 '22 11:11 divagant-martian

NetworkBehaviour already has quite a few callbacks, although we are getting rid of some of them in #3011. Thus, I am a bit hesitant to continue in the direction of #2828.

So here is an alternative idea:

Make `IntoConnectionHandler` fallible

NetworkBehaviour::new_handler is already invoked for every incoming connection. If a NetworkBehaviour would like to impose a policy on which connections should be established, it can pass the required data to a prototype implementation of IntoConnectionHandler and make the decision as soon as into_handler is called with the PeerId etc.

There are a few drawbacks with the current design but I think those are all fixable:

If somehow possible, I'd like to get rid of IntoConnetionHandler and delay calling new_handler until we have established the connection. This would allows us to directly make new_handler fallible which would make this feature much easier to use. Additionally, it means we won't run into race conditions with simple limit connection managers as new_handler is called with &mut self and thus, will be called sequentially for each connection.
NetworkBehaviourAction::Dial currently also emits an entire handler. To make sure connection limits are enforced in a single place, I'd replace this mechanism with an "OpenInfo" style where Dial can carry some "OpenInfo" struct that is passed back in new_handler.

I see the following benefits of this design:

Good use of the type system: We need a ConnectionHandler for each connection. If we fail to construct one, we can never actually establish the connection.
Synchronous decision making: Async decision making would require some kind of timeout so we would potentially hold every connection for some amount of time before we can consider it to be established.
Small API surface: Very little change to the existing API of NetworkBehaviour.

cc @divagant-martian @dignifiedquire @rkuhn @mxinden

Nov 08 '22 06:11 thomaseizinger

The proposal would work very well for us and save a lot of time spent on disconnecting peers we were not interested in having in the first place :+1:

Nov 08 '22 15:11 divagant-martian

f somehow possible, I'd like to get rid of IntoConnetionHandler and delay calling new_handler until we have established the connection. This would allows us to directly make new_handler fallible which would make this feature much easier to use. Additionally, it means we won't run into race conditions with simple limit connection managers as new_handler is called with &mut self and thus, will be called sequentially for each connection.

I am in favor of getting rid of IntoConnectionHandler as in my eyes it adds a lot of complexity.

NetworkBehaviourAction::Dial currently also emits an entire handler. To make sure connection limits are enforced in a single place, I'd replace this mechanism with an "OpenInfo" style where Dial can carry some "OpenInfo" struct that is passed back in new_handler.

It would be wonderful to have a single mechanism only to create a new handler.

Synchronous decision making: Async decision making would require some kind of timeout so we would potentially hold every connection for some amount of time before we can consider it to be established.

In my eyes this is the way to go. We also achieved consensus on this in https://github.com/libp2p/rust-libp2p/discussions/2118#discussioncomment-3134752.

Nov 08 '22 21:11 mxinden

I've put up a draft PR here that aims to build the foundation for the above idea.

Nov 09 '22 02:11 thomaseizinger

The proposal would work very well for us and save a lot of time spent on disconnecting peers we were not interested in having in the first place +1

A note on this: From the other party's perspective, it is still a regular disconnect, it just happens automatically within the Swarm.

Nov 09 '22 02:11 thomaseizinger

We do still need a mechanism for a NetworkBehaviour to allow or deny pending inbound or outbound connections.

(In https://github.com/libp2p/rust-libp2p/pull/2828 that happens via new methods on NetworkBehaviour.)

Nov 09 '22 08:11 mxinden

@mxinden wouldn't it be enough making the new handler function fallible? I as I understand this is part of the proposal. Am I missing something?

Nov 09 '22 11:11 divagant-martian

Unless I am missing something, the only thing not possible with my proposal is blocking a new incoming connection before the upgrade process is finished.

For any policy that purely operates on the number of connections, these upgrades would be wasted (plus they consume resources).

Nov 09 '22 12:11 thomaseizinger

We could retain a prototype-based system where creating a handler is a two-step process for inbound connections:

Replace new_inbound_handler with on_new_pending_connection that returns a Result.
The Ok of said result is a handler prototype.
The handler prototype has a fallible conversion into the actual handler.

The first failure point can be used for pending inbound connections before the upgrade. The second failure point can be used for established connections and policies like banning by peer ID.

Nov 09 '22 12:11 thomaseizinger

Right, that makes sense and would tackle both cases. Would that be reasonable to add to your current work @thomaseizinger or do you think it should be an effort for a second iteration?

Nov 09 '22 15:11 divagant-martian

The second failure point can be used for established connections and policies like banning by peer ID.

But that second failure point needs some synchronization mechanism back to a single point of truth, likely living in the NetworkBehaviour, i.e. some central place e.g. counting the number of inbound connections.

Removing the IntoConnectionHandler indirection, adding a NetworkBehaviour::on_pending_connection and returning a ConnectionHandler via NetworkBehaviour::new_handler would resolve the need for a synchronization mechanism.

Nov 09 '22 23:11 mxinden

The second failure point can be used for established connections and policies like banning by peer ID.

But that second failure point needs some synchronization mechanism back to a single point of truth, likely living in the NetworkBehaviour, i.e. some central place e.g. counting the number of inbound connections.

Counting policies would use the first failure point, data driven policies like allow/ban lists would use the latter one.

List-based policies don't need synchronisation.

Removing the IntoConnectionHandler indirection, adding a NetworkBehaviour::on_pending_connection and returning a ConnectionHandler via NetworkBehaviour::new_handler would resolve the need for a synchronization mechanism.

The argument for synchronisation with NB stems from it having more knowledge right? But a pending connection doesn't provide any information apart from the incoming multiaddress. This makes me think that policies that need to take into account pending connections are likely going to be resource-based, i.e. RAM usage, number of connections etc.

Pushing these policies into NB doesn't feel right.

NBs are composed but resource-based policies are likely global.
The owner of Swarm likely knows more how constrained the resources are.

This makes me think that policing pending connections should either happen in the Swarm or some other, non-composed module that a Swarm can be configured with. Perhaps it should even happen on a Transport level as a Transport wrapper?

Nov 09 '22 23:11 thomaseizinger

I've put up a draft PR here that aims to build the foundation for the above idea.

Removing IntoConnectionHandler is proving difficult (see PR comments).

I am tempted to build a version of this that doesn't remove IntoConnectionHandler but makes the conversion fallible instead.

@divagant-martian Would that be sufficient for your usecase?

A connection management behaviour would have to create ConnectionHandler prototypes that have the necessary data embedded to make a decision about the incoming connection.

For example, if you have a list of banned peers, you'd copy this list into the prototype upon constructing it in new_handler.

Nov 13 '22 21:11 thomaseizinger

I put up a draft PR here for what this could look like: https://github.com/libp2p/rust-libp2p/pull/3118

Nov 14 '22 07:11 thomaseizinger

I now have a fully working version of the current codebase without the IntoConnectionHandler abstraction: https://github.com/libp2p/rust-libp2p/pull/3099

I think this can be merged as an independent improvement. We can then later decide how we want to manage pending connections.

Nov 15 '22 08:11 thomaseizinger

We had multiple requests for rust-libp2p to be able to prevent dialing any private IP addresses. Thus far we have pointed them to the following Transport wrapper:

https://github.com/mxinden/kademlia-exporter/blob/master/src/exporter/client/global_only.rs

Somehow I was operating with the assumption that we can now use the generic NetworkBehaviour connection management system to implement the above. More specifically one would return an Error from NetworkBehaviour::handle_pending_outbound_connection.

Unfortunately this is not quite possible. Say that a NetworkBehaviour emits a dial request with one private and one global IP address. The connection-management NetworkBehaviour can either deny all or allow all, but not deny the private and all the global via NetworkBehaviour::handle_pending_outbound_connection.

Should we support the above use-case, e.g. by returning both a set of additional addresses and a set of blocked addresses from NetworkBehaviour::handle_pending_outbound_connection? Or should we just continue recommending the above Transport wrapper, thus blocking private IP addresses at the Transport layer instead.

I am leaning towards the latter, i.e. not extend the NetworkBehaviour::handle_pending_outbound_connection but instead provide the GlobalIpOnly Transport implementation from the rust-libp2p mono repository.

@thomaseizinger any thoughts on this?

Feb 28 '23 16:02 mxinden

That is something that would be nice to support. It is a bit tricky though.

Even if we pass a mutable reference of the existing addresses into the behaviour, you can never guarantee that another behaviour doesn't add more non-global addresses.

How effective such a global-ip address management is depends on how you compose your behaviours which is quite the foot-gun.

In this case, refusing the dial on the Transport layer is much more effective.

Feb 28 '23 21:02 thomaseizinger

Maybe we've mixed too many concerns here by merging addresses_of_peer into handle_pending_outbound_connection. If we were to undo that, we could gather all addresses before and then pass a custom collection type to handle_pending_outbound_connection that only allows removal and not adding new addresses. (a &mut [Multiaddr] would be too permissive).

Feb 28 '23 21:02 thomaseizinger

Agreed with the concerns above. I suggest continuing with the custom Transport implementation for now.

Mar 02 '23 11:03 mxinden

swarm/: Support generic connection management

Description

Motivation

Are you planning to do it yourself in a pull request?

Make IntoConnectionHandler fallible

Make `IntoConnectionHandler` fallible