rust-libp2p icon indicating copy to clipboard operation
rust-libp2p copied to clipboard

Gossipsub backpressure seems to not work for forwarded messages

Open sirandreww-starkware opened this issue 4 months ago • 6 comments

Summary

When there is a lot of traffic in a gossipsub network, message prioritization prioritizes published messages over forwarded messages. Thus published messages tend to succeed without issue. This results in the produces of these messages to not get any indication of back-pressure, although the entire networks is basically foregoing all gossip and all forwarding because these are not prioritized.

Expected behavior

That when the network is flooded with messages that publishing a message would start to fail as soon as forwarding or gossip are being dropped.

Actual behavior

No indication that this is happening on the publish producer side.

Relevant log output

When 10 nodes are flooding the network with messages as long as backpressure allows this is the combined logs I start receiving:

2025-08-05T15:52:33.265365Z  WARN libp2p_gossipsub::behaviour: Send Queue full. Could not send Forward { message: RawMessage { source: Some(PeerId("12D3KooWQYhTNQdmr3ArTeUHRYzFg94BKyTkoWBDWez9kSCVe2Xo")), data: [0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 7, 31, 0, 0, 0, 0, 0, 0, 0, 0, 24, 88, 233, 212, 41, 209, 248, 37, 0, 0, 0, 0, 0, 0, 0, 30, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], sequence_number: Some(1754409146581427900), topic: TopicHash { hash: "9pwO9iZEpm2K+wAEFbckQaShVNAVcmbmTr+sZZ/iW0s=" }, signature: Some([175, 42, 27, 53, 9, 247, 193, 143, 23, 169, 98, 238, 26, 253, 156, 144, 52, 11, 81, 95, 232, 205, 205, 121, 118, 121, 26, 10, 38, 253, 33, 106, 10, 17, 84, 204, 227, 51, 160, 74, 118, 13, 201, 172, 204, 38, 105, 219, 112, 150, 177, 98, 119, 16, 154, 92, 9, 69, 10, 111, 43, 168, 97, 9]), key: None, validated: true }, timeout: Delay }. peer=12D3KooWDMCQbZZvLgHiHntG1KwcHoqHPAxL37KvhgibWqFtpqUY
2025-08-05T15:52:33.265379Z  WARN libp2p_gossipsub::behaviour: Send Queue full. Could not send Forward { message: RawMessage { source: Some(PeerId("12D3KooWH3uVF6wv47WnArKHk5p6cvgCJEb74UTmxztmQDc298L3")), data: [0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 5, 243, 0, 0, 0, 0, 0, 0, 0, 0, 24, 88, 233, 211, 227, 232, 160, 112, 0, 0, 0, 0, 0, 0, 0, 30, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], sequence_number: Some(1754409146581785812), topic: TopicHash { hash: "9pwO9iZEpm2K+wAEFbckQaShVNAVcmbmTr+sZZ/iW0s=" }, signature: Some([31, 75, 20, 186, 227, 209, 209, 161, 216, 121, 30, 105, 21, 224, 113, 212, 93, 199, 75, 206, 25, 155, 110, 146, 48, 214, 83, 177, 226, 167, 223, 169, 204, 69, 175, 94, 98, 208, 142, 17, 97, 105, 80, 111, 218, 52, 105, 115, 229, 239, 224, 218, 212, 184, 186, 126, 179, 81, 27, 108, 130, 37, 181, 10]), key: None, validated: true }, timeout: Delay }. peer=12D3KooWLJtG8fd2hkQzTn96MrLvThmnNQjTUFZwGEsLRz5EmSzc
2025-08-05T15:52:33.265413Z  WARN libp2p_gossipsub::behaviour: Send Queue full. Could not send Forward { message: RawMessage { source: Some(PeerId("12D3KooWQYhTNQdmr3ArTeUHRYzFg94BKyTkoWBDWez9kSCVe2Xo")), data: [0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 7, 157, 0, 0, 0, 0, 0, 0, 0, 0, 24, 88, 233, 212, 62, 155, 77, 73, 0, 0, 0, 0, 0, 0, 0, 30, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], sequence_number: Some(1754409146581428026), topic: TopicHash { hash: "9pwO9iZEpm2K+wAEFbckQaShVNAVcmbmTr+sZZ/iW0s=" }, signature: Some([175, 142, 40, 145, 159, 138, 97, 56, 109, 240, 226, 35, 145, 37, 35, 157, 113, 171, 174, 229, 243, 199, 118, 220, 155, 181, 230, 8, 97, 145, 217, 227, 115, 240, 18, 208, 5, 182, 81, 71, 217, 38, 150, 78, 151, 235, 88, 88, 110, 36, 45, 166, 249, 226, 150, 240, 113, 198, 27, 98, 39, 255, 241, 9]), key: None, validated: true }, timeout: Delay }. peer=12D3KooWLJtG8fd2hkQzTn96MrLvThmnNQjTUFZwGEsLRz5EmSzc
2025-08-05T15:52:33.265487Z  WARN libp2p_gossipsub::behaviour: Send Queue full. Could not send Forward { message: RawMessage { source: Some(PeerId("12D3KooWLnZUpcaBwbz9uD1XsyyHnbXUrJRmxnsMiRnuCmvPix67")), data: [0, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 9, 107, 0, 0, 0, 0, 0, 0, 0, 0, 24, 88, 233, 212, 69, 110, 159, 45, 0, 0, 0, 0, 0, 0, 0, 30, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], sequence_number: Some(1754409146592128455), topic: TopicHash { hash: "9pwO9iZEpm2K+wAEFbckQaShVNAVcmbmTr+sZZ/iW0s=" }, signature: Some([167, 24, 77, 215, 227, 232, 194, 209, 180, 116, 225, 134, 97, 199, 48, 51, 104, 54, 10, 86, 236, 213, 183, 253, 51, 12, 125, 242, 217, 37, 115, 1, 210, 153, 189, 118, 167, 213, 114, 148, 76, 252, 232, 144, 49, 45, 182, 249, 255, 36, 70, 95, 38, 63, 43, 215, 173, 192, 16, 116, 168, 202, 49, 0]), key: None, validated: true }, timeout: Delay }. peer=12D3KooWLJtG8fd2hkQzTn96MrLvThmnNQjTUFZwGEsLRz5EmSzc
2025-08-05T15:52:33.265522Z  WARN libp2p_gossipsub::behaviour: Send Queue full. Could not send Forward { message: RawMessage { source: Some(PeerId("12D3KooWQYhTNQdmr3ArTeUHRYzFg94BKyTkoWBDWez9kSCVe2Xo")), data: [0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 7, 162, 0, 0, 0, 0, 0, 0, 0, 0, 24, 88, 233, 212, 64, 251, 67, 40, 0, 0, 0, 0, 0, 0, 0, 30, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], sequence_number: Some(1754409146581428031), topic: TopicHash { hash: "9pwO9iZEpm2K+wAEFbckQaShVNAVcmbmTr+sZZ/iW0s=" }, signature: Some([164, 204, 153, 224, 252, 7, 107, 162, 52, 99, 31, 142, 251, 50, 143, 54, 144, 93, 96, 67, 99, 56, 134, 103, 196, 151, 245, 203, 125, 124, 102, 224, 138, 164, 122, 235, 101, 77, 3, 25, 236, 229, 101, 180, 230, 19, 50, 85, 10, 57, 19, 93, 131, 223, 14, 122, 70, 221, 201, 75, 154, 153, 108, 2]), key: None, validated: true }, timeout: Delay }. peer=12D3KooWLJtG8fd2hkQzTn96MrLvThmnNQjTUFZwGEsLRz5EmSzc

Possible Solution

 #[allow(clippy::result_large_err)]
    pub(crate) fn send_message(&self, rpc: RpcOut) -> Result<(), RpcOut> {
        if let RpcOut::Publish { .. } = rpc {
            // Update number of publish message in queue.
            let len = self.len.load(Ordering::Relaxed);
            if len >= self.priority_cap {
                return Err(rpc);
            }
            self.len.store(len + 1, Ordering::Relaxed);
        }
        let sender = match rpc {
            RpcOut::Publish { .. }
            | RpcOut::Graft(_)
            | RpcOut::Prune(_)
            | RpcOut::Subscribe(_)
            | RpcOut::Unsubscribe(_) => &self.priority_sender,
            RpcOut::Forward { .. } | RpcOut::IHave(_) | RpcOut::IWant(_) | RpcOut::IDontWant(_) => {
                &self.non_priority_sender
            }
        };
        sender.try_send(rpc).map_err(|err| err.into_inner())
    }

Perhaps changing the priority of publish messages to be non_priority as well

Version

libp2p = "0.56.0"

Would you like to work on fixing this bug?

Yes

sirandreww-starkware avatar Aug 05 '25 15:08 sirandreww-starkware

Hi, The backpressure system in gossipsub was designed to prevent slow peers from potentially DOS'ing by boundlessly growing message queues. For network health, message priority is important - that's why publishing has higher priority than forwarding. You can read more about the design on #5595 and #4914.

When the log you presented happens, the failed_messages counter increases for that peer and the Behavior returns an Event::SlowPeer on the next poll, so you do get the feedback you're looking for.

The scenario you described sounds like network-wide congestion rather than a backpressure issue. If "the entire network is basically foregoing all gossip and all forwarding," it suggests the network is handling more messages than it can process effectively.

A few options to consider:

  • Increase the queue limit to handle bursts better
  • Implement application-level rate limiting based on SlowPeer events
  • Consider if the message volume can be reduced at the application level

Meanwhile we are also experimenting with a new queue design here if you wanna take a look

jxs avatar Aug 09 '25 09:08 jxs

I see, Is there a way to change the priority in specific applications?

sirandreww-starkware avatar Aug 10 '25 06:08 sirandreww-starkware

no, but what would be the advantage of that?

jxs avatar Aug 10 '25 23:08 jxs

It seems to me as if when the network has a lot of messages going through it we prefer adding new messages over taking care of the messages already there. Though I admit, I might not fully understand why DDOS is possible if this priority is not taken (Is it cause I can silence a node by flooding it with messages? But it's not like forwarding is higher priority than publishing...)

sirandreww-starkware avatar Aug 11 '25 05:08 sirandreww-starkware

can you elaborate on what you mean by taking care of the messages already there? Yes the vector is that, if a node's send queue grows boundlessly it eventually OOM's and goes offline, you can more about it here. And more about the prioritization here and here

jxs avatar Aug 11 '25 10:08 jxs

@jxs Isn't the entire idea of back-pressure to not have endlessly growing queues but to fail sending messages when sending is too high throughput? Assume I'm in a network where all peers on a certain topic have an infinite number of messages to send, there are now 2 options:

  1. Prioritize publish over forward and gossip: This means that when node X is trying to publish its message on the topic, this broadcast will be successful, essentially allowing the node to push as many messages as it likes. If node X has a lot of messages to send (assume infinite supply) then there is no clear mechanism to throttle the rate of messages being sent by node X. This however is true for all nodes, thus all nodes will be sending messages and flooding other peers, at this point no forwarding is going on and all the nodes are just broadcasting to their direct grafted peers, and those peers are not forwarding these messages. If node X gets a SlowPeer event then it is unclear if this is the peer's fault by being slow, or ours for flooding it with messages, and so it is unclear if we should slow down our message throughput or punish the peer somehow.
  2. FIFO queue on non control messages (Publish, Forward, Gossip): This means that when node X is trying to publish its message on the topic, this broadcast will not always be successful, since sometimes the message queue is full. If some node X has an infinite supply of messages, the failure to send these messages is the standard indication that back-pressure is applied.

What am I missing here? This is the behavior I'm noticing when stress testing gossip-sub... If there is only one broadcaster, then backpressure works as expected by receiving failures when broadcasting, thus indicating that we must send less throughput

sirandreww-starkware avatar Aug 11 '25 11:08 sirandreww-starkware