rust-libp2p icon indicating copy to clipboard operation
rust-libp2p copied to clipboard

AutoNAT: multiple opened connection

Open hamamo opened this issue 3 years ago • 8 comments

It seems that AutoNAT opens peer connections as part of its probing loop but does not close them again, so after some time connected peers have dozens of open connections between them. I only have a very rudimentary understanding of the implications but it doesn't look quite right :-)

hamamo avatar Feb 14 '22 11:02 hamamo

From what I understand, connections don't need to be explicitly closed. Instead, a connection is generally only kept alive while the handler of at least one network behaviour reports KeepAlive::True (ref). The AutoNAT protocol internally wraps the request-response protocol: a probe is an outbound request, and the dial-response from the server is the response. After the server sent the response, the handler in the request-response protocol should not be keeping the connection alive anymore (only for 10s, as this is its default connection timeout), hence it should close. But in case that you are using any other network behaviour protocols in combination with AutoNAT, it could be that they keep the connection alive? @hamamo Could you share a bit more information about the behaviours that you are using? Or maybe I am missing something here, so please correct if that is the case.

elenaf9 avatar Feb 14 '22 12:02 elenaf9

@elenaf9 I'm using ping to keep connections alive, that may be the cause of the observed behavior. Since my application relies on timely distribution of information via gossipsub, I'm trying to keep the nodes connected most of the time. This may be something I change when I have a better sync mechanism for historical data, and nodes can catch up when they reconnect, but I'd rather have autonat and ping behave well together :-)

hamamo avatar Feb 18 '22 08:02 hamamo

same here

rkuhn avatar Feb 18 '22 08:02 rkuhn

Since my application relies on timely distribution of information via gossipsub, I'm trying to keep the nodes connected most of the time.

I am not too familiar with the details of gossipsub, but doesn't the gossipsub protocol itself already takes care that you stay connected with the peer in a mesh? What's the benefit of also using the ping for keep-alive?

https://github.com/libp2p/rust-libp2p/blob/eeb3504d5f5ce783b7c7333afe3814aed2255746/protocols/gossipsub/src/handler.rs#L271-L275

But independent of that, I still agree that autonat and ping should behave well together. What I am not so sure about is whether this is the responsibility of ping, or of autonat.

The whole logic around keep-alive follows more the idea of "every behaviour protocol decides for themself if they need the substream or not". Afaik none of the protocols explicitly closes a connection? Instead, even after the protocol that initiated the connection doesn't need that connection anymore, it is kept open "in case that any other protocol needs it". And imho network behaviours should not rely on other logic to explicitly close connections again. It could also be the case that the user frequently calls Swarm::dial for whatever reason, in which case we would run into the same problem.

I think ping is really the only protocol that just blindly keeps every substream alive. And the description for its keep-alive actually states that it keeps only the connection alive, not every substream. So what I would propose instead is to change Ping so that it only keeps at least one connection to a peer alive, but not every single one.

Overall, I am myself a bit undecided on this. If you still think that AutoNAT should explicitly clean up the connection that are opened in a dial-attempt I am fine with that as well, as I definitely see the reason for it. But my above point still stands. What do you think?

elenaf9 avatar Feb 18 '22 10:02 elenaf9

I'm using ping to keep connections alive, that may be the cause of the observed behavior.

Do I understand correctly that you are setting PingConfig::with_keep_alive(true) @hamamo?

https://github.com/libp2p/rust-libp2p/blob/2ff9acee227bc8787fb41d657de5f174250e2588/protocols/ping/src/handler.rs#L52-L54

Since my application relies on timely distribution of information via gossipsub, I'm trying to keep the nodes connected most of the time.

Note that connections are kept alive, even without PingConfig::keep_alive(true), as long as they are being used. E.g., as @elenaf9 mentioned above, in the case of gossipsub that means that a connection is kept alive as long as the peer is part of the corresponding mesh.

@hamamo if you can provide a small reproducer, or point us to your code itself, that would be great.

I know we are not doing a great job documenting the connection lifecycle mechanism in libp2p. Let me know, I can extend the documentation by a paragraph or two in the next couple of days.

mxinden avatar Feb 18 '22 18:02 mxinden

Today I tried setting keep_alive in PingConfig to false, but that causes the connection to be closed, and GossipSub events to be missed. So if a mesh is established but not permanently used it does not seem to keep connections alive, and GossipSub also doesn't attempt to re-dial peers known to be part of the mesh.

@mxinden: The code is at https://github.com/hamamo/reputation-net/tree/tokio (the master branch where I used async_std isn't updated yet). I know this is a big ball of spaghetti, didn't have the time yet to come up with a clean small example showing the issue.

hamamo avatar Feb 26 '22 10:02 hamamo

So if a mesh is established but not permanently used it does not seem to keep connections alive

That is surprising to me. Once a peer joins a mesh, the in_mesh flag on the corresponding ConnectionHandler should be set to true and thus the connection should be kept alive.

https://github.com/libp2p/rust-libp2p/blob/master/protocols/gossipsub/src/handler.rs#L132-L134

Could you check whether a peer successfully joins a mesh but is still disconnected? Also could you post the logs (RUST_LOG=debug) of a small reputation-net reproducing the issue @hamamo?

mxinden avatar Mar 01 '22 10:03 mxinden

Just a wild guess, seeing logic in reputation_net/mod.rs related to awaiting storage operations within the custom NetworkBehaviour. Do these operations block the main libp2p task for longer period of time? If so, this might cause connections to time out. Let's see what the logs say.

https://github.com/hamamo/reputation-net/blob/tokio/src/reputation_net/mod.rs

mxinden avatar Mar 01 '22 10:03 mxinden

We are reworking idle connection management, see https://github.com/libp2p/rust-libp2p/issues/4306.

I am closing this as stale.

thomaseizinger avatar Sep 20 '23 02:09 thomaseizinger