rust-libp2p Seeming lack of backpressure in gossipsub.publish

I made a gossipsub implementation, and added on some benchmarking: https://github.com/Frederik-Baetens/libp2p-tokiochat/tree/perf_test

Run with:

cargo run --release

Then in the second terminal, run the suggested command you get from the first command, but add release mode, and bench at the end like so:

cargo run --release -- /ip4/172.20.10.3/tcp/40759 12D3KooWLmWsMNuHsNUqCgAbbv6d3g2QYZVev8KqLfjiqTtDeZtk bench```

This shows an enormous amount of messages being "stored" in libp2p internally. By the time the sender prints that it has sent its last message, only a small part of messages have actually arrived. This lack of backpressure is on the senders side, because when I kill the sender after it says it has published the last message, messages also stop arriving on the receiver's end.

Alternatively, are there any workarounds to add backpressure myself?

Feb 20 '22 19:02 Frederik-Baetens

After having a brief look at the implementation, it seems like publish calls this code:

    /// Send a GossipsubRpc message to a peer. This will wrap the message in an arc if it
    /// is not already an arc.
    fn send_message(
        &mut self,
        peer_id: PeerId,
        message: rpc_proto::Rpc,
    ) -> Result<(), PublishError> {
        // If the message is oversized, try and fragment it. If it cannot be fragmented, log an
        // error and drop the message (all individual messages should be small enough to fit in the
        // max_transmit_size)

        let messages = self.fragment_message(message)?;

        for message in messages {
            self.events
                .push_back(NetworkBehaviourAction::NotifyHandler {
                    peer_id,
                    event: Arc::new(GossipsubHandlerIn::Message(message)),
                    handler: NotifyHandler::Any,
                })
        }
        Ok(())
    }

I think adding backpressure would either require publish to become an async method, or to become a blocking method. This would obviously be a breaking change, which may not be desirable. However, I do think that one of those two choices would be the cleanest way to handle this.

Alternatively, users could opt into backperssure themselves if libp2p provided a method to retrieve the length of the events: VecDeque<GossipsubNetworkBehaviourAction>.

Another alternative is to introduce an additional publish_async method, which provides backpressure.

Both of these alternative solutions feel like ugly solutions to me though.

Perhaps the best solution is to start blocking only when the "events" vecdeque becomes larger than a certain size? This would preserve the non-blocking non-async characteristics that publish currently has, while offering backpressure when overloaded.

Feb 20 '22 21:02 Frederik-Baetens

Thanks for raising this @Frederik-Baetens.

I think this is a general issue across the many NetworkBehaviour implementations. I would rephrase it as: "Synchronous methods on NetworkBehaviour implementations to be used by the user through swarm.behaviour_mut do not offer backpressure today."

I have been thinking a bunch about this lately, especially relevant in case we ever port a resource hungry protocol like bitswap to rust-libp2p.

adding backpressure would either require publish to become an async method,

In my eyes this would be the way to go. Though likely using poll style methods instead of async methods as one would otherwise loose access to the Swarm across await points.

A simple primitive that should be helpful here is a local future-aware VecDeque. Similar to a mpsc::channel it would have a limit and take a Context when accepting new items. In the concrete case of libp2p-gossipsub we could use this custom VecDeque for self.events. GossipSub::publish would take a cx, check whether there is enough space in self.events and if not, register the Waker from the cx and return Poll::Pending. Once GossipSub::poll takes an element out of self.events, the Waker is woken and thus the user notified that it should try again calling GossipSub::publish.

@Frederik-Baetens let me know in case you would be interested to design and/or implement this for one of the many NetworkBehaviours. Happy to continue here or on a call.

As always, input by others is most certainly welcome as well.

Feb 23 '22 14:02 mxinden

I think the bounded channel in https://github.com/smol-rs/async-channel does exactly what you describe, with minimal dependencies. Maybe not fully optimized for this use case, but likely an easy way to start this work.

Feb 23 '22 14:02 dignifiedquire

@dignifiedquire Using such a channel seems like a good idea. In my own project, I've already worked around lib2p's single swarm owner design by using tokio::sync::mpsc to send events into an object which has ownership over the swarm, like so:

#[derive(Clone)]
pub struct SwarmHandle {
    sender: Sender<(MyMessageType, IdentTopic)>,
    topic: IdentTopic,
    ...
}

impl SwarmHandle {
    pub async fn publish(&self, prefix: String, address: SocketAddr, version: u32) {
        self.sender.send(...).await.unwrap();
    }
}

It kind of makes me wonder why using channels isn't the default way of interacting with the swarm. (I realize that the above design isn't ideal, since some methods require error reporting & other things to be returned to the user)

In general, I think there's a lot of UX weirdness caused by this single owner swarm/networkbehaviour design. I think #2474 is also harder to address because of this one owner at a time design. Is there a design doc somewhere explaining why this design was chosen? I understand that dealing with p2p networking would be a bit more complicated than classic networking, but interacting with libp2p trough this manual event loop code style still seems a bit foreign to me. I think that encouraging the use of such a manual async event loop might also put rust-libp2p users at risk for difficult to diagnose bugs

@mxinden I'd love to help desigining this. I'm not ready to commit to implementing it, but I'll consider doing so once there's a rough design laid out. I'd love to hop on a call, I've meaning to come to the community calls, but I always end up missing them. I've added the next one to my calendar!

Feb 24 '22 00:02 Frederik-Baetens

Is there a design doc somewhere explaining why this design was chosen?

Not aware of a design doc. The major benefit I see is flexibility. That is, you can embed it in any kind of application architecture. Both small ones with a single event loop and large ones using channels to coordinate.

I've added the next one to my calendar!

:rocket:

Feb 25 '22 16:02 mxinden

I wonder if libp2p could provide a standard adapter that gives channel based access, so that folks can get in easier and only replace it with a manual loop when they actually need to.

Feb 25 '22 16:02 dignifiedquire

We could promote something along the lines of the file sharing example to a proper wrapper around a Swarm, generic over the NetworkBehaviour in use.

Feb 28 '22 15:02 mxinden

Closing in favor of https://github.com/libp2p/rust-libp2p/issues/3078.

Sep 20 '23 02:09 thomaseizinger