rust-libp2p icon indicating copy to clipboard operation
rust-libp2p copied to clipboard

Bug: rendezvous server panics in response to libp2p_request_response::Event::Message

Open scglenn opened this issue 8 months ago • 8 comments

Summary

Currently running a rendezvous server that helps nodes bootstrap into a set of peers under a topic.

The rendezvous server handled up to about 1000 connections with no issue, but now we have seen this rendezvous server get a panic from within the libp2p-rendezvous crate.

The number of connections has increased to about 3k-5k and this issue started to show up. We removed pending connection and established connection limits and started to see this behavior.

thread 'tokio-runtime-worker' panicked at /home/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/libp2p-rendezvous-0.14.0/src/server.rs:184:38:

This is the where the panic is occurring: https://github.com/libp2p/rust-libp2p/blob/74e315719cb4f934ba9b7052b8bc5e565dda3599/protocols/rendezvous/src/server.rs#L184

Image

I'm unsure if the issue is related to the number of connections, but that's the only thing that has changed.

The comments suggest that self.inner.send_response will bubble up an error: If the [ResponseChannel] is already closed due to a timeout or the connection being closed, the response is returned as an Err for further handling. https://github.com/libp2p/rust-libp2p/blob/master/protocols/request-response/src/lib.rs#L484

Image

Questions:

  • Is it correct behavior for self.inner.send_response to panic if a connection timed out or closed?
  • How are we supposed to handle this?
  • Any ideas as to why this could be happening?

Additional Note:

  • I would expect the system to be able to handle many more connections than this given i have seen others show data where nodes are handling 10k-40k connections.

Expected behavior

I would expect for the response to fail for that specific connection and for the rendezvous server to continue handling all the other connections that are happening rather than crashing

Actual behavior

It panics frequently, after a couple minutes of handling connections.

Relevant log output

`thread 'tokio-runtime-worker' panicked at /home/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/libp2p-rendezvous-0.14.0/src/server.rs:184:38:`

Possible Solution

Perhaps the error should just be handled in a way that doesnt panic the entire swarm?

Version

libp2p=version = "0.53.2"

Would you like to work on fixing this bug?

Maybe

scglenn avatar Apr 19 '25 18:04 scglenn

Additionally, i saw an interesting note from a previous issue: "However, it is true that the peer could close the connection before it has a chance to respond to the request, though in this case it happened after the request is received while polling the request-response behaviour."

https://github.com/libp2p/rust-libp2p/issues/5570#issuecomment-2322769659

@dariusc93 Perhaps this is similar?

scglenn avatar Apr 19 '25 20:04 scglenn

Additionally, i saw an interesting note from a previous issue: "However, it is true that the peer could close the connection before it has a chance to respond to the request, though in this case it happened after the request is received while polling the request-response behaviour."

#5570 (comment)

@dariusc93 Perhaps this is similar?

It's very possible. Could you provide any logs and your swarm configuration?

dariusc93 avatar Apr 20 '25 04:04 dariusc93

p2psetup.txt

Perhaps this helps @dariusc93

the logs i have at this moment dont show much more insight other than the panic itself, I can enable debug and see if there is anything notable

Note: that i did test a deployment where i set connection limits to default, which i believe is just a max u32, so i dont think that is causing the issue

Additionally, does it make sense for a panic to occur here? i would expect connections to potentially expire or close at any time, but i wouldnt expect that to crash the swarm

scglenn avatar Apr 20 '25 21:04 scglenn

p2psetup.txt

Perhaps this helps @dariusc93

the logs i have at this moment dont show much more insight other than the panic itself, I can enable debug and see if there is anything notable

Note: that i did test a deployment where i set connection limits to default, which i believe is just a max u32, so i dont think that is causing the issue

Thanks. Though it should not be the issue, you might want to move the connection limit behaviour to the top so it is handled first when the behaviours are polled by swarm since polling them later in the tree can cause an inconsistent state if the connection is later denied while other behaviours accept the connection. See https://github.com/libp2p/rust-libp2p/issues/4870. Besides that i dont see an exact cause in the code you provided so it may just be some inconsistency in the state of the behaviours.

dariusc93 avatar Apr 20 '25 22:04 dariusc93

Thanks. Though it should not be the issue, you might want to move the connection limit behaviour to the top so it is handled first when the behaviours are polled by swarm since polling them later in the tree can cause an inconsistent state if the connection is later denied while other behaviours accept the connection. See #4870. Besides that i dont see an exact cause in the code you provided so it may just be some inconsistency in the state of the behaviours.

Perhaps i should move the rendezvous server behavior to the top with the connection limit behavior above it also? this way rendezvous server events get handled first?

scglenn avatar Apr 20 '25 22:04 scglenn

Moving connection limit and rendezvous server behaviors to the top has not fixed the panic issue so far

scglenn avatar Apr 21 '25 17:04 scglenn

Could you provide a log with RUST_LOG=debug used for when it panics?

dariusc93 avatar Apr 21 '25 20:04 dariusc93

Could you provide a log with RUST_LOG=debug used for when it panics?

From the panic before we forked libp2p

Apr 17 00:48:22 ip-10-0-1-103 node[4634]: thread 'tokio-runtime-worker' panicked at /home/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/libp2p-rendezvous-0.14.0/src/server.rs:184:38:
Apr 17 00:48:22 ip-10-0-1-103 node[4634]: Send response: DiscoverResponse(Ok(([Registration { namespace: Namespace(<redacted>), record: PeerRecord { peer_id: PeerId(<redacted>), seq: 1744850900, addresses: [<redacted>], envelope: SignedEnvelope { key: PublicKey { publickey: Ed25519(PublicKey(compressed): e153a5b4ffd298f7c934dbc11c91e5b6e97e1ad6984117a6d147d4a644a7b7a1) }, payload_type: [47, 108, 105, 98, 112, 50, 112, 47, 114, 111, 117, 116, 105, 110, 103, 45, 115, 116, 97, 116, 101, 45, 114, 101, 99, 111, 114, 100], payload: [10, 38, 0, 36, 8, 1, 18, 32, 225, 83, 165, 180, 255, 210, 152, 247, 201, 52, 219, 193, 28, 145, 229, 182, 233, 126, 26, 214, 152, 65, 23, 166, 209, 71, 212, 166, 68, 167, 183, 161, 16, 212, 159, 129, 192, 6, 26, 10, 10, 8, 4, 62, 169, 24, 179, 6, 122, 81], signature: [168, 181, 223, 55, 194, 86, 7, 237, 139, 113, 177, 178, 114, 139, 221, 17, 56, 243, 111, 91, 190, 249, 44, 38, 128, 146, 6, 61, 50, 236, 64, 221, 174, 142, 131, 40, 183, 86, 182, 102, 248, 208, 93, 21, 137, 51, 50, 77, 254, 110, 129, 105, 252, 42, 36, 77, 174, 241, 98, 215, 114, 56, 15, 13] } }, ttl: 7200 }], Cookie { id: 4808417751412047776, namespace: Some(Namespace(<redacted>)) })))

We updated the rendezvous logic to not panic if a connection dropped and we see this (we added our own log that states "failed to send response" rather than panic:

$ sudo journalctl -u service --since "2025-04-21 20:09:00" -f | grep -i "failed to send response"
Apr 21 20:30:05 ip-10-0-1-103 drosera-seed-node[42328]: 2025-04-21T20:30:05.712449Z  WARN Swarm::poll: libp2p_rendezvous::server: Failed to send response: DiscoverResponse(Ok(([Registration { namespace: Namespace("<redacted>"), record: PeerRecord { peer_id: PeerId("<redacted>"), seq: 1745267404, addresses: [<redacted>], envelope: SignedEnvelope { key: PublicKey { publickey: Ed25519(PublicKey(compressed): 83ed6a1568123621ae8d21d84a9f63c23fb31872a47b53660ef8c8682f8ed31) }, payload_type: [47, 108, 105, 98, 112, 50, 112, 47, 114, 111, 117, 116, 105, 110, 103, 45, 115, 116, 97, 116, 101, 45, 114, 101, 99, 111, 114, 100], payload: [10, 38, 0, 36, 8, 1, 18, 32, 131, 237, 106, 21, 104, 18, 54, 33, 174, 141, 33, 216, 74, 159, 99, 194, 63, 179, 24, 7, 42, 71, 181, 54, 96, 239, 140, 134, 130, 248, 237, 49, 16, 204, 213, 154, 192, 6, 26, 10, 10, 8, 4, 167, 86, 80, 63, 6, 122, 81], signature: [179, 153, 53, 19, 123, 37, 166, 35, 71, 14, 5, 160, 71, 9, 163, 78, 109, 152, 252, 79, 9, 82, 162, 7, 24, 52, 21, 242, 51, 55, 6, 127, 62, 254, 152, 205, 97, 170, 69, 28, 16, 37, 55, 34, 98, 76, 73, 107, 91, 240, 32, 245, 148, 55, 251, 229, 231, 241, 94, 0, 122, 249, 186, 3] } }, ttl: 7200 }], Cookie { id: 8618633838550913715, namespace: Some(Namespace("<redacted>")) }))) peer_id=<same as previous redacted>
Apr 21 20:35:05 ip-10-0-1-103 drosera-seed-node[42328]: 2025-04-21T20:35:05.775727Z  WARN Swarm::poll: libp2p_rendezvous::server: Failed to send response: DiscoverResponse(Ok(([Registration { namespace: Namespace("<redacted>"), record: PeerRecord { peer_id: PeerId("<redacted>"), seq: 1745267704, addresses: [<redacted>], envelope: SignedEnvelope { key: PublicKey { publickey: Ed25519(PublicKey(compressed): 83ed6a1568123621ae8d21d84a9f63c23fb31872a47b53660ef8c8682f8ed31) }, payload_type: [47, 108, 105, 98, 112, 50, 112, 47, 114, 111, 117, 116, 105, 110, 103, 45, 115, 116, 97, 116, 101, 45, 114, 101, 99, 111, 114, 100], payload: [10, 38, 0, 36, 8, 1, 18, 32, 131, 237, 106, 21, 104, 18, 54, 33, 174, 141, 33, 216, 74, 159, 99, 194, 63, 179, 24, 7, 42, 71, 181, 54, 96, 239, 140, 134, 130, 248, 237, 49, 16, 248, 215, 154, 192, 6, 26, 10, 10, 8, 4, 167, 86, 80, 63, 6, 122, 81], signature: [6, 199, 249, 112, 29, 188, 211, 73, 96, 55, 142, 133, 204, 49, 187, 68, 237, 66, 216, 79, 248, 74, 169, 151, 51, 34, 73, 8, 28, 21, 101, 81, 194, 165, 165, 204, 226, 134, 123, 0, 143, 162, 156, 30, 251, 153, 248, 218, 8, 197, 175, 60, 238, 214, 10, 182, 160, 65, 44, 133, 228, 166, 6, 3] } }, ttl: 7200 }], Cookie { id: 6970538189100493197, namespace: Some(Namespace("<redacted>")) }))) peer_id=<redacted>

redacted a couple things but its all the same peer and same ip

Here is the change we made as a PR: https://github.com/libp2p/rust-libp2p/pull/6002

Now we are no longer panicking with these changes

scglenn avatar Apr 21 '25 21:04 scglenn