rust-libp2p
rust-libp2p copied to clipboard
AutoNAT: multiple opened connection
It seems that AutoNAT opens peer connections as part of its probing loop but does not close them again, so after some time connected peers have dozens of open connections between them. I only have a very rudimentary understanding of the implications but it doesn't look quite right :-)
From what I understand, connections don't need to be explicitly closed.
Instead, a connection is generally only kept alive while the handler of at least one network behaviour reports KeepAlive::True
(ref). The AutoNAT protocol internally wraps the request-response
protocol: a probe is an outbound request, and the dial-response from the server is the response. After the server sent the response, the handler in the request-response
protocol should not be keeping the connection alive anymore (only for 10s, as this is its default connection timeout), hence it should close. But in case that you are using any other network behaviour protocols in combination with AutoNAT, it could be that they keep the connection alive?
@hamamo Could you share a bit more information about the behaviours that you are using?
Or maybe I am missing something here, so please correct if that is the case.
@elenaf9 I'm using ping
to keep connections alive, that may be the cause of the observed behavior. Since my application relies on timely distribution of information via gossipsub
, I'm trying to keep the nodes connected most of the time.
This may be something I change when I have a better sync mechanism for historical data, and nodes can catch up when they reconnect, but I'd rather have autonat
and ping
behave well together :-)
same here
Since my application relies on timely distribution of information via gossipsub, I'm trying to keep the nodes connected most of the time.
I am not too familiar with the details of gossipsub, but doesn't the gossipsub protocol itself already takes care that you stay connected with the peer in a mesh? What's the benefit of also using the ping for keep-alive?
https://github.com/libp2p/rust-libp2p/blob/eeb3504d5f5ce783b7c7333afe3814aed2255746/protocols/gossipsub/src/handler.rs#L271-L275
But independent of that, I still agree that autonat
and ping
should behave well together. What I am not so sure about is whether this is the responsibility of ping
, or of autonat
.
The whole logic around keep-alive follows more the idea of "every behaviour protocol decides for themself if they need the substream or not". Afaik none of the protocols explicitly closes a connection? Instead, even after the protocol that initiated the connection doesn't need that connection anymore, it is kept open "in case that any other protocol needs it". And imho network behaviours should not rely on other logic to explicitly close connections again. It could also be the case that the user frequently calls Swarm::dial
for whatever reason, in which case we would run into the same problem.
I think ping is really the only protocol that just blindly keeps every substream alive. And the description for its keep-alive actually states that it keeps only the connection alive, not every substream. So what I would propose instead is to change Ping so that it only keeps at least one connection to a peer alive, but not every single one.
Overall, I am myself a bit undecided on this. If you still think that AutoNAT should explicitly clean up the connection that are opened in a dial-attempt I am fine with that as well, as I definitely see the reason for it. But my above point still stands. What do you think?
I'm using ping to keep connections alive, that may be the cause of the observed behavior.
Do I understand correctly that you are setting PingConfig::with_keep_alive(true)
@hamamo?
https://github.com/libp2p/rust-libp2p/blob/2ff9acee227bc8787fb41d657de5f174250e2588/protocols/ping/src/handler.rs#L52-L54
Since my application relies on timely distribution of information via gossipsub, I'm trying to keep the nodes connected most of the time.
Note that connections are kept alive, even without PingConfig::keep_alive(true)
, as long as they are being used. E.g., as @elenaf9 mentioned above, in the case of gossipsub that means that a connection is kept alive as long as the peer is part of the corresponding mesh.
@hamamo if you can provide a small reproducer, or point us to your code itself, that would be great.
I know we are not doing a great job documenting the connection lifecycle mechanism in libp2p. Let me know, I can extend the documentation by a paragraph or two in the next couple of days.
Today I tried setting keep_alive
in PingConfig
to false, but that causes the connection to be closed, and GossipSub events to be missed. So if a mesh is established but not permanently used it does not seem to keep connections alive, and GossipSub also doesn't attempt to re-dial peers known to be part of the mesh.
@mxinden: The code is at https://github.com/hamamo/reputation-net/tree/tokio (the master branch where I used async_std isn't updated yet). I know this is a big ball of spaghetti, didn't have the time yet to come up with a clean small example showing the issue.
So if a mesh is established but not permanently used it does not seem to keep connections alive
That is surprising to me. Once a peer joins a mesh, the in_mesh
flag on the corresponding ConnectionHandler
should be set to true
and thus the connection should be kept alive.
https://github.com/libp2p/rust-libp2p/blob/master/protocols/gossipsub/src/handler.rs#L132-L134
Could you check whether a peer successfully joins a mesh but is still disconnected? Also could you post the logs (RUST_LOG=debug
) of a small reputation-net reproducing the issue @hamamo?
Just a wild guess, seeing logic in reputation_net/mod.rs
related to awaiting storage operations within the custom NetworkBehaviour
. Do these operations block the main libp2p task for longer period of time? If so, this might cause connections to time out. Let's see what the logs say.
https://github.com/hamamo/reputation-net/blob/tokio/src/reputation_net/mod.rs
We are reworking idle connection management, see https://github.com/libp2p/rust-libp2p/issues/4306.
I am closing this as stale.