rust-libp2p
rust-libp2p copied to clipboard
protocols/gossipsub: Custom interval grows and is not performant
@mxinden - Your input here might also be helpful.
After a bit of digging, we've found that the gossipsub heartbeat can be set to the order of 1s but grow to the order of many minutes. This heartbeat is crucial for gossipsub to maintain its peers and meshes as well as bound its memory cache and gossip messages on time.
A many-minute heartbeat can cause terrible conditions on the network and significantly grow memory use of gossipsub users.
We've tracked the issue down to the custom implementation of Interval
in https://github.com/libp2p/rust-libp2p/blob/master/protocols/gossipsub/src/interval.rs.
As an example, we have three nodes running, in one (green one) I've replaced the custom Interval
with a tokio::time::Interval
. The heartbeat is supposed to be 700ms.
Halfway through these graphs I've switched the green node to tokio::time::Interval
. The heartbeat interval (time between heartbeats) then stays steady on 700ms whereas the other nodes have grown and continue to do so.
It's also useful to note that the heartbeat duration (time taken to do a heartbeat) itself is about 3-4x faster with tokio::time::Interval
than with the custom implementation we are currently using.
After a very quick skim, I think the issue of the growing interval lies in this calcualtion: https://github.com/libp2p/rust-libp2p/blob/master/protocols/gossipsub/src/interval.rs#L86
Where if we miss a interval the fires_at
and delay
can potentially become out of sync.
It is useful to note that the heartbeat interval grows in multiples of the desired interval.
It would be nice to get the current implementation to work correctly and be as performant as the tokio version. Failing this, potentially we add a feature flag or something so that tokio users can benefit from the performance gains.
cc @divagant-martian @pawanjay176
Checked our metrics again now that some hours have passed and gains are consistent
The step increases are what @AgeManning mentions as increases by multiples of the heartbeat duration
Also to note, the interval code refers to wasm-timer
as the original source. I wrote a small code that uses an Interval and adds a disruption, calculates de moving avg of the last 5 ticks. wasm-timer
maintains the expected rolling average, the current implementation grows. Is there a reason to avoid using wasm-timer
directly? (If any interest in the code arises I can upload it somewhere)
Oh, surprising. Thanks for the debugging work and the detailed report @AgeManning and @divagant-martian!
It would be nice to get the current implementation to work correctly and be as performant as the tokio version.
Agreed.
Is there a reason to avoid using wasm-timer directly?
If I am not mistaken we moved to support running as WASM in NodeJS. See https://github.com/libp2p/rust-libp2p/issues/2143 for details.
I am surprised that this causing issues now, given that https://github.com/libp2p/rust-libp2p/blob/master/protocols/gossipsub/src/interval.rs is a port of wasm-timer
interval.
I wonder whether we should port the recent version of tokio's interval implementation to rust-libp2p or even better contribute an Interval
implementation to futures-timer
upstream (see https://github.com/async-rs/futures-timer/issues/60).
@wngr @PhilippGackstatter @thomaseizinger any thoughts on this?
just for completeness, the experiment code https://github.com/divagant-martian/interval-example
port the recent version of tokio's interval implementation [...] or even better contribute an
Interval
implementation tofutures-timer
upstream
Can't comment much on the issue itself, but out of those two, I think the latter is preferable from a compatibility point of view. The futures-timer
implementation could hopefully use gloo_timers::future::IntervalStream which has good Wasm compatibility (including node.js).
I agree in contributing if possible to this in the appropriate repository. Having a custom Interval
implementation to give support to one target inside gossipsub puts the burden of maintenance on libp2p
and particularly in libp2p-gossipsub
, when it really is out of scope
The current implementation as it stands has some serious problems (even though it's taken us a while to track it down).
@divagant-martian has indicated that the wasm-timer
implementation doesn't grow the heartbeat interval and is therefore useable, however is not as performant as the tokio implementation.
For Lighthouse, we're probably going to a run a fork for the time being with the tokio implementation, and I suggest we use the wasm-timer
implementation for gossipsub until a solution is found that can also support WASM in NodeJS.
It's probably worth looking for any other places where this implementation of Interval
is used modifying it to wasm-timer
(if the custom implementation was made for other parts of the codebase).
https://github.com/libp2p/rust-libp2p/pull/2506 is merged. I will leave this issue open to track support for wasm-unknown-unknown
for libp2p-gossipsub
.
Next step
I think it is worth exploring contributing an Interval
implementation to futures-timer
(https://github.com/async-rs/futures-timer/issues/60) using gloo_timers
on WASM as suggested by @PhilippGackstatter above.
I don't have the capacity / priority to own this any time soon. In case someone wants to run libp2p-gossipsub
on wasm-unknown-unknown
and wants to pick this up, I am happy to help.
Backport
For Lighthouse, we're probably going to a run a fork for the time being with the tokio implementation, and I suggest we use the wasm-timer implementation for gossipsub until a solution is found that can also support WASM in NodeJS.
I am assuming, given that you already released a patch version of lighthouse (https://github.com/sigp/lighthouse/releases/tag/v2.1.3) that you don't need a backport of this patch? If you do need a backport, I am happy to release libp2p
v0.42.3
.
Yeah I dont think we need a backport at the moment. There's a few more small changes (metrics) to add to gossipsub, so happy to wait for next release cycle. :)
Should this issue be closed?
Thanks @divagant-martian.