rust-libp2p icon indicating copy to clipboard operation
rust-libp2p copied to clipboard

protocols/gossipsub: Custom interval grows and is not performant

Open AgeManning opened this issue 3 years ago • 9 comments

@mxinden - Your input here might also be helpful.

After a bit of digging, we've found that the gossipsub heartbeat can be set to the order of 1s but grow to the order of many minutes. This heartbeat is crucial for gossipsub to maintain its peers and meshes as well as bound its memory cache and gossip messages on time.

A many-minute heartbeat can cause terrible conditions on the network and significantly grow memory use of gossipsub users.

We've tracked the issue down to the custom implementation of Interval in https://github.com/libp2p/rust-libp2p/blob/master/protocols/gossipsub/src/interval.rs.

As an example, we have three nodes running, in one (green one) I've replaced the custom Interval with a tokio::time::Interval. The heartbeat is supposed to be 700ms.

allg

Halfway through these graphs I've switched the green node to tokio::time::Interval. The heartbeat interval (time between heartbeats) then stays steady on 700ms whereas the other nodes have grown and continue to do so. It's also useful to note that the heartbeat duration (time taken to do a heartbeat) itself is about 3-4x faster with tokio::time::Interval than with the custom implementation we are currently using.

After a very quick skim, I think the issue of the growing interval lies in this calcualtion: https://github.com/libp2p/rust-libp2p/blob/master/protocols/gossipsub/src/interval.rs#L86

Where if we miss a interval the fires_at and delay can potentially become out of sync.

It is useful to note that the heartbeat interval grows in multiples of the desired interval.

It would be nice to get the current implementation to work correctly and be as performant as the tokio version. Failing this, potentially we add a feature flag or something so that tokio users can benefit from the performance gains.

cc @divagant-martian @pawanjay176

AgeManning avatar Feb 09 '22 06:02 AgeManning

Checked our metrics again now that some hours have passed and gains are consistent image The step increases are what @AgeManning mentions as increases by multiples of the heartbeat duration

divagant-martian avatar Feb 09 '22 11:02 divagant-martian

Also to note, the interval code refers to wasm-timer as the original source. I wrote a small code that uses an Interval and adds a disruption, calculates de moving avg of the last 5 ticks. wasm-timer maintains the expected rolling average, the current implementation grows. Is there a reason to avoid using wasm-timer directly? (If any interest in the code arises I can upload it somewhere)

divagant-martian avatar Feb 09 '22 12:02 divagant-martian

Oh, surprising. Thanks for the debugging work and the detailed report @AgeManning and @divagant-martian!

It would be nice to get the current implementation to work correctly and be as performant as the tokio version.

Agreed.

Is there a reason to avoid using wasm-timer directly?

If I am not mistaken we moved to support running as WASM in NodeJS. See https://github.com/libp2p/rust-libp2p/issues/2143 for details.

I am surprised that this causing issues now, given that https://github.com/libp2p/rust-libp2p/blob/master/protocols/gossipsub/src/interval.rs is a port of wasm-timer interval.

I wonder whether we should port the recent version of tokio's interval implementation to rust-libp2p or even better contribute an Interval implementation to futures-timer upstream (see https://github.com/async-rs/futures-timer/issues/60).

@wngr @PhilippGackstatter @thomaseizinger any thoughts on this?

mxinden avatar Feb 09 '22 16:02 mxinden

just for completeness, the experiment code https://github.com/divagant-martian/interval-example

divagant-martian avatar Feb 09 '22 17:02 divagant-martian

port the recent version of tokio's interval implementation [...] or even better contribute an Interval implementation to futures-timer upstream

Can't comment much on the issue itself, but out of those two, I think the latter is preferable from a compatibility point of view. The futures-timer implementation could hopefully use gloo_timers::future::IntervalStream which has good Wasm compatibility (including node.js).

PhilippGackstatter avatar Feb 09 '22 17:02 PhilippGackstatter

I agree in contributing if possible to this in the appropriate repository. Having a custom Interval implementation to give support to one target inside gossipsub puts the burden of maintenance on libp2p and particularly in libp2p-gossipsub, when it really is out of scope

divagant-martian avatar Feb 09 '22 18:02 divagant-martian

The current implementation as it stands has some serious problems (even though it's taken us a while to track it down).

@divagant-martian has indicated that the wasm-timer implementation doesn't grow the heartbeat interval and is therefore useable, however is not as performant as the tokio implementation.

For Lighthouse, we're probably going to a run a fork for the time being with the tokio implementation, and I suggest we use the wasm-timer implementation for gossipsub until a solution is found that can also support WASM in NodeJS.

It's probably worth looking for any other places where this implementation of Interval is used modifying it to wasm-timer (if the custom implementation was made for other parts of the codebase).

AgeManning avatar Feb 09 '22 22:02 AgeManning

https://github.com/libp2p/rust-libp2p/pull/2506 is merged. I will leave this issue open to track support for wasm-unknown-unknown for libp2p-gossipsub.

Next step

I think it is worth exploring contributing an Interval implementation to futures-timer (https://github.com/async-rs/futures-timer/issues/60) using gloo_timers on WASM as suggested by @PhilippGackstatter above.

I don't have the capacity / priority to own this any time soon. In case someone wants to run libp2p-gossipsub on wasm-unknown-unknown and wants to pick this up, I am happy to help.

Backport

For Lighthouse, we're probably going to a run a fork for the time being with the tokio implementation, and I suggest we use the wasm-timer implementation for gossipsub until a solution is found that can also support WASM in NodeJS.

I am assuming, given that you already released a patch version of lighthouse (https://github.com/sigp/lighthouse/releases/tag/v2.1.3) that you don't need a backport of this patch? If you do need a backport, I am happy to release libp2p v0.42.3.

mxinden avatar Feb 14 '22 10:02 mxinden

Yeah I dont think we need a backport at the moment. There's a few more small changes (metrics) to add to gossipsub, so happy to wait for next release cycle. :)

AgeManning avatar Feb 14 '22 22:02 AgeManning

Should this issue be closed?

divagant-martian avatar May 26 '23 21:05 divagant-martian

Thanks @divagant-martian.

mxinden avatar May 28 '23 06:05 mxinden