rust-libp2p icon indicating copy to clipboard operation
rust-libp2p copied to clipboard

Set a better default for `idle_connection_timeout`

Open thomaseizinger opened this issue 2 years ago • 12 comments

As discussed yesterday in https://github.com/libp2p/rust-libp2p/discussions/4905.

Needs:

  • Decide on a better default (!= 0 seconds)
  • Update of all examples
  • Some docs on the method explaining what it does any why

thomaseizinger avatar Nov 22 '23 03:11 thomaseizinger

A default of 10 seconds may be enough to accommodate majority of the protocols, but setting to 30 seconds may be the best balance in my opinion

dariusc93 avatar Nov 22 '23 05:11 dariusc93

30 seconds seems like a lot to me. The main usecase that we talked about yesterday was to be able to initiate protocols with peers from outside the Swarm based on events like kademlia queries. Those would usually happen very fast and a default timeout of 5 seconds would easily accommodate for that.

Anything longer feels too application-specific but I don't have any data to back that up.

thomaseizinger avatar Nov 22 '23 06:11 thomaseizinger

An alternative would be to entirely remove auto-closing and instead emit an event notifying the user that the connection is idle. Would that perhaps be useful?

thomaseizinger avatar Dec 01 '23 01:12 thomaseizinger

An alternative would be to entirely remove auto-closing and instead emit an event notifying the user that the connection is idle. Would that perhaps be useful?

That actually doesnt sound like a bad idea. Kind of curious to see how that would play out long term.

dariusc93 avatar Dec 01 '23 02:12 dariusc93

An alternative would be to entirely remove auto-closing and instead emit an event notifying the user that the connection is idle. Would that perhaps be useful?

That actually doesnt sound like a bad idea. Kind of curious to see how that would play out long term.

Typically, the problem with these kind of approaches is that by default, users will only handle the events that the care about and _ => {} match the rest. So an event ConnectionIsIdle will likely go unnoticed for many of them.

In this particular case though, that doesn't actually do any harm! (Until you run into connection limits for example). At that point however, you are probably familiar enough with rust-libp2p to learn, that this event exists and you can take whatever action you prefer.

To make the simple case of a timeout easier, we could re-emit this event every X seconds and include, how long the connection is already idle. Then users wouldn't need to keep any state outside of the Swarm if they e.g. wanted to close connections after them being idle for say 10 seconds.

Curious to hear what @mxinden thinks.

thomaseizinger avatar Dec 01 '23 04:12 thomaseizinger

I find it quite difficult to gauge what would be considered a bigger annoyance / "rust-libp2p"-ism:

  • Auto-closing connections when they are idle (regardless how long the timeout is)
  • Having to close idle connections yourself

From a library PoV, the latter is "better" because it gives the user more power over the policy on when to close connections. At the same time, not closing them at all might be a weird default?

We could add ConnectionEvent::StateChange { idle: bool } which would allow us to add a module that can auto-close connections once they are idle for X duration.

thomaseizinger avatar Dec 06 '23 22:12 thomaseizinger

I'm curious what amount of resources open connections use. Are they cheap or expensive? I would think that would influence the decision a bit?

DougAnderson444 avatar Dec 06 '23 22:12 DougAnderson444

On @mxinden kademlia-exporter node, a single connection uses about 300kb. I'd expect an idle connection, even with more protocols enabled to not consume much more than that because the presence of buffered data would mean that the connection is not idle. Hence, all buffers should be empty on an idle connection, unless we have a memory leak somewhere.

In addition to that, an open connection consumes a file descriptor (or equivalent depending on OS).

thomaseizinger avatar Dec 06 '23 23:12 thomaseizinger

couple of things might related to this idle connection handling, that we observed from our large scaled network (over 4k nodes) using kad + gossipsub : 1, the connection that being openned (via get_closest_peer or put/get record ops or by gossipsub) remains alive and not get auto closed , even with idel_connection_timeout got setup via with_idle_connection_timeout . My guess of that is: the connects was openned for network query like get_closest, but as there is gossipsub that will periodicaly contact connected peers, those openned connection then remains non-ilde even the original network query ops got completed. 2, to keep a live connection, the per connection mem_usage is not much (around 50-150KB per connection). But with the point 1, with large scaled testnet, we observed nodes holding over 3k connections and consuming 400MB. 3, We have to deploy a mechanism to manually close the outdated (has been established for a while) connections from not in kbuckets peers This reduces the mem_usage a lot as live connections is capped by number of peers in the RT. 4, The gossipsub turns out to be rely on the live connections (instead of local peer knowledge + establish connection on request) to ensure a message reaches all subscribers. We noticed there is no loss of topic msg when nodes keep all connections live, but will lost some topic msg when close some non-RT connections

maqi avatar Dec 12 '23 10:12 maqi

1, the connection that being openned (via get_closest_peer or put/get record ops or by gossipsub) remains alive and not get auto closed , even with idel_connection_timeout got setup via with_idle_connection_timeout . My guess of that is: the connects was openned for network query like get_closest, but as there is gossipsub that will periodicaly contact connected peers, those openned connection then remains non-ilde even the original network query ops got completed.

To my knowledge, if the connected peers are subscribed to the same gossipsub topic on the node, their connection will be kept alive until they are no longer apart of the mesh (eg unsubscribed from the topic). Not sure if this is applicable for your use case though.

2, to keep a live connection, the per connection mem_usage is not much (around 50-150KB per connection). But with the point 1, with large scaled testnet, we observed nodes holding over 3k connections and consuming 400MB.

Just being curious, is this with just TCP or TCP + Quic setup?

dariusc93 avatar Dec 12 '23 11:12 dariusc93

if the connected peers are subscribed to the same gossipsub topic on the node

No, most of the nodes are not subscribed to any topic. Only 2% of the nodes subscribed to the same topic. i.e. with 4k nodes, only 80 nodes are within gossipsub. However, gossipsub protocol does mention it will broadcast to non-subscribed connected peers as I understand.

is this with just TCP or TCP + Quic setup?

just TCP

maqi avatar Dec 12 '23 12:12 maqi

An active gossipsub stream will keep the connection alive even if the nodes aren't subscribed to the same topics. There isn't really much we can do about that. The best solution is to (as you are already doing), close those connections manually.

thomaseizinger avatar Jan 15 '24 01:01 thomaseizinger