rust-libp2p
rust-libp2p copied to clipboard
Failing debug_assert in Request-Response protocol
Summary
I am hitting panics in debug mode at this debug_assert.
While the debug_assert is panic in request-response code, I do NOT think is a bug in request-response. I took a look at it and it seems to be related to a combination of factors.
- Multiple inflight dial requests must be happening
- Some of those inflight dials must get denied
I made a smallish reproducer by hacking the file-sharing
example on this branch. I added the mdns
behaviour so that I can easily get multiple inflight dial requests. I added the libp2p_connection_limits
behaviour so that its easy to deny some of those requests. I set the max per peer limit to 2.
Finally to reproduce I run these commands from the example/file-sharing
directory:
Provider
RUST_LOG=info,libp2p_swarm=debug cargo run -- \
--listen-addresses /ip4/0.0.0.0/tcp/40837 \
--secret-key-seed 1 provide \
--path README.md \
--name README.md
Note using /ip4/0.0.0.0/
binds to multiple listen addresses. I have 4-5 local interfaces that mean I listen on that many addresses. This means that when mdns discovers a peer it attempts to dial each address separately and we attempt to dial them as separate concurrent dial requests.
Retriever
RUST_LOG=info,libp2p_swarm=debug cargo run -- get --name README.md
It seems that the request-response behaviour is receiving an inconsistent set of events. It receives both a call to handle_established_outbound_connection
and a swarm event DialFailure
for the same connection id. This breaks the assumptions of the request-response internal state management and triggers the debug_assert failure.
Expected behavior
I expect debug_asserts to not panic. Additionally I expect that for a single connection id it is either considered failed or established but not both.
Actual behavior
The debug_asserts panics because the request-response logic receives conflicting information about a connection.
Relevant log output
Here are logs from the modified `file-sharing` example run.
Of note is that the `request-response` behaviour receives both a `handle_established_outbound_connection` event from the swarm as well as a `DialFailure` event from the swarm for the same connection id. This confuses the request-response internal state can causes the debug_assert failure.
[2023-10-31T22:09:57Z INFO libp2p_request_response] handle_established_outbound_connection 1
[2023-10-31T22:09:57Z DEBUG libp2p_swarm] Connection established: PeerId("12D3KooWPjceQrSwdWXPyLLeABRXmuqt69Rg3sBYbU1Nft9HyQ6X") Dialer { address: "/ip4/192.168.0.136/tcp/40837/p2p/12D3KooWPjceQrSwdWXPyLLeABRXmuqt69Rg3sBYbU1Nft9HyQ6X", role_override: Dialer }; Total (peer): 1.
[2023-10-31T22:09:57Z INFO libp2p_request_response] handle_established_outbound_connection 3
[2023-10-31T22:09:57Z DEBUG libp2p_swarm] Connection established: PeerId("12D3KooWPjceQrSwdWXPyLLeABRXmuqt69Rg3sBYbU1Nft9HyQ6X") Dialer { address: "/ip4/192.168.12.189/tcp/40837/p2p/12D3KooWPjceQrSwdWXPyLLeABRXmuqt69Rg3sBYbU1Nft9HyQ6X", role_override: Dialer }; Total (peer): 2.
[2023-10-31T22:09:57Z INFO libp2p_request_response] handle_established_outbound_connection 4
[2023-10-31T22:09:57Z INFO libp2p_request_response] on_dial_failure 4 Denied { cause: ConnectionDenied { inner: Exceeded { limit: 2, kind: EstablishedPerPeer } } }
[2023-10-31T22:09:57Z INFO libp2p_request_response] handle_established_outbound_connection 2
[2023-10-31T22:09:57Z INFO libp2p_request_response] on_dial_failure 2 Denied { cause: ConnectionDenied { inner: Exceeded { limit: 2, kind: EstablishedPerPeer } } }
[2023-10-31T22:10:17Z DEBUG libp2p_swarm] Connection closed with error KeepAliveTimeout: Connected { endpoint: Dialer { address: "/ip4/192.168.0.136/tcp/40837/p2p/12D3KooWPjceQrSwdWXPyLLeABRXmuqt69Rg3sBYbU1Nft9HyQ6X", role_override: Dialer }, peer_id: PeerId("12D3KooWPjceQrSwdWXPyLLeABRXmuqt69Rg3sBYbU1Nft9HyQ6X") }; Total (peer): 1.
[2023-10-31T22:10:17Z INFO libp2p_request_response] on_connection_closed 1 3 1
[2023-10-31T22:10:17Z DEBUG libp2p_swarm] Connection closed with error KeepAliveTimeout: Connected { endpoint: Dialer { address: "/ip4/192.168.12.189/tcp/40837/p2p/12D3KooWPjceQrSwdWXPyLLeABRXmuqt69Rg3sBYbU1Nft9HyQ6X", role_override: Dialer }, peer_id: PeerId("12D3KooWPjceQrSwdWXPyLLeABRXmuqt69Rg3sBYbU1Nft9HyQ6X") }; Total (peer): 0.
[2023-10-31T22:10:17Z INFO libp2p_request_response] on_connection_closed 3 2 0
thread 'async-std/runtime' panicked at /home/nathanielc/projects/rust-libp2p/protocols/request-response/src/lib.rs:640:9:
assertion `left == right` failed
left: false
right: true
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'main' panicked at examples/file-sharing/src/network.rs:125:24:
Sender not to be dropped.: Canceled
Possible Solution
One solution is to make request-response more robust to these inconsistent events from the swarm but that seems like a fix that only hides the real problem.
Version
v0.52.4
The reproducer branch is based off the libp2p-v0.52.4
tag.
Would you like to work on fixing this bug ?
Yes
We've landed some big internal changes to request-response
recently: https://github.com/libp2p/rust-libp2p/pull/3914.
Can you retest with latest master whether or not that still happens?
I am still able to reproduce the error on master.
Here is a branch with a reproducer against master.
To repro run from within the examples/file-sharing
directory:
RUST_LOG=info,libp2p_swarm=debug cargo run -- \
--listen-address /ip4/0.0.0.0/tcp/40837 \
--secret-key-seed 1 provide \
--path README.md \
--name README.md
and
RUST_LOG=info,libp2p_swarm=debug cargo run -- get --name README.md
Logs:
[2023-11-01T20:37:59Z INFO libp2p_request_response] handle_established_outbound_connection 3
[2023-11-01T20:37:59Z DEBUG libp2p_swarm] Connection established: PeerId("12D3KooWPjceQrSwdWXPyLLeABRXmuqt69Rg3sBYbU1Nft9HyQ6X") Dialer { address: "/ip4/172.22.0.1/tcp/40837/p2p/12D3KooWPjceQrSwdWXPyLLeABRXmuqt69Rg3sBYbU1Nft9HyQ6X", role_override: Dialer }; Total (peer): 1.
[2023-11-01T20:37:59Z INFO libp2p_request_response] handle_established_outbound_connection 1
[2023-11-01T20:37:59Z DEBUG libp2p_swarm] Connection established: PeerId("12D3KooWPjceQrSwdWXPyLLeABRXmuqt69Rg3sBYbU1Nft9HyQ6X") Dialer { address: "/ip4/172.17.0.1/tcp/40837/p2p/12D3KooWPjceQrSwdWXPyLLeABRXmuqt69Rg3sBYbU1Nft9HyQ6X", role_override: Dialer }; Total (peer): 2.
[2023-11-01T20:37:59Z INFO libp2p_request_response] handle_established_outbound_connection 2
[2023-11-01T20:37:59Z INFO libp2p_request_response] on_dial_failure 2
[2023-11-01T20:37:59Z INFO libp2p_request_response] handle_established_outbound_connection 4
[2023-11-01T20:37:59Z INFO libp2p_request_response] on_dial_failure 4
I took a stab at a fix here https://github.com/libp2p/rust-libp2p/pull/4777 open to other ideas for how to fix this.
Hi, I am facing the same errors when running our Filecoin Rust client in debug mode (using libp2p-request-response-0.26.1
if that helps).
Hi, I am facing the same errors when running our Filecoin Rust client in debug mode (using
libp2p-request-response-0.26.1
if that helps).
Have you tried re-ordering your behaviours to have all connection management ones come first? See https://github.com/libp2p/rust-libp2p/pull/4777#issuecomment-1813482146.