rust-libp2p icon indicating copy to clipboard operation
rust-libp2p copied to clipboard

Failing debug_assert in Request-Response protocol

Open nathanielc opened this issue 1 year ago • 5 comments

Summary

I am hitting panics in debug mode at this debug_assert.

While the debug_assert is panic in request-response code, I do NOT think is a bug in request-response. I took a look at it and it seems to be related to a combination of factors.

  • Multiple inflight dial requests must be happening
  • Some of those inflight dials must get denied

I made a smallish reproducer by hacking the file-sharing example on this branch. I added the mdns behaviour so that I can easily get multiple inflight dial requests. I added the libp2p_connection_limits behaviour so that its easy to deny some of those requests. I set the max per peer limit to 2.

Finally to reproduce I run these commands from the example/file-sharing directory:

Provider

RUST_LOG=info,libp2p_swarm=debug cargo run -- \
    --listen-addresses /ip4/0.0.0.0/tcp/40837 \
    --secret-key-seed 1 provide \
    --path README.md \
    --name README.md

Note using /ip4/0.0.0.0/ binds to multiple listen addresses. I have 4-5 local interfaces that mean I listen on that many addresses. This means that when mdns discovers a peer it attempts to dial each address separately and we attempt to dial them as separate concurrent dial requests.

Retriever

RUST_LOG=info,libp2p_swarm=debug cargo run --  get --name README.md

It seems that the request-response behaviour is receiving an inconsistent set of events. It receives both a call to handle_established_outbound_connection and a swarm event DialFailure for the same connection id. This breaks the assumptions of the request-response internal state management and triggers the debug_assert failure.

Expected behavior

I expect debug_asserts to not panic. Additionally I expect that for a single connection id it is either considered failed or established but not both.

Actual behavior

The debug_asserts panics because the request-response logic receives conflicting information about a connection.

Relevant log output

Here are logs from the modified `file-sharing` example run.

Of note is that the `request-response` behaviour receives both a `handle_established_outbound_connection` event from the swarm as well as a `DialFailure` event from the swarm for the same connection id. This confuses the request-response internal state can causes the debug_assert failure.


[2023-10-31T22:09:57Z INFO  libp2p_request_response] handle_established_outbound_connection 1
[2023-10-31T22:09:57Z DEBUG libp2p_swarm] Connection established: PeerId("12D3KooWPjceQrSwdWXPyLLeABRXmuqt69Rg3sBYbU1Nft9HyQ6X") Dialer { address: "/ip4/192.168.0.136/tcp/40837/p2p/12D3KooWPjceQrSwdWXPyLLeABRXmuqt69Rg3sBYbU1Nft9HyQ6X", role_override: Dialer }; Total (peer): 1.
[2023-10-31T22:09:57Z INFO  libp2p_request_response] handle_established_outbound_connection 3
[2023-10-31T22:09:57Z DEBUG libp2p_swarm] Connection established: PeerId("12D3KooWPjceQrSwdWXPyLLeABRXmuqt69Rg3sBYbU1Nft9HyQ6X") Dialer { address: "/ip4/192.168.12.189/tcp/40837/p2p/12D3KooWPjceQrSwdWXPyLLeABRXmuqt69Rg3sBYbU1Nft9HyQ6X", role_override: Dialer }; Total (peer): 2.
[2023-10-31T22:09:57Z INFO  libp2p_request_response] handle_established_outbound_connection 4
[2023-10-31T22:09:57Z INFO  libp2p_request_response] on_dial_failure 4 Denied { cause: ConnectionDenied { inner: Exceeded { limit: 2, kind: EstablishedPerPeer } } }
[2023-10-31T22:09:57Z INFO  libp2p_request_response] handle_established_outbound_connection 2
[2023-10-31T22:09:57Z INFO  libp2p_request_response] on_dial_failure 2 Denied { cause: ConnectionDenied { inner: Exceeded { limit: 2, kind: EstablishedPerPeer } } }
[2023-10-31T22:10:17Z DEBUG libp2p_swarm] Connection closed with error KeepAliveTimeout: Connected { endpoint: Dialer { address: "/ip4/192.168.0.136/tcp/40837/p2p/12D3KooWPjceQrSwdWXPyLLeABRXmuqt69Rg3sBYbU1Nft9HyQ6X", role_override: Dialer }, peer_id: PeerId("12D3KooWPjceQrSwdWXPyLLeABRXmuqt69Rg3sBYbU1Nft9HyQ6X") }; Total (peer): 1.
[2023-10-31T22:10:17Z INFO  libp2p_request_response] on_connection_closed 1 3 1
[2023-10-31T22:10:17Z DEBUG libp2p_swarm] Connection closed with error KeepAliveTimeout: Connected { endpoint: Dialer { address: "/ip4/192.168.12.189/tcp/40837/p2p/12D3KooWPjceQrSwdWXPyLLeABRXmuqt69Rg3sBYbU1Nft9HyQ6X", role_override: Dialer }, peer_id: PeerId("12D3KooWPjceQrSwdWXPyLLeABRXmuqt69Rg3sBYbU1Nft9HyQ6X") }; Total (peer): 0.
[2023-10-31T22:10:17Z INFO  libp2p_request_response] on_connection_closed 3 2 0
thread 'async-std/runtime' panicked at /home/nathanielc/projects/rust-libp2p/protocols/request-response/src/lib.rs:640:9:
assertion `left == right` failed
  left: false
 right: true
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'main' panicked at examples/file-sharing/src/network.rs:125:24:
Sender not to be dropped.: Canceled

Possible Solution

One solution is to make request-response more robust to these inconsistent events from the swarm but that seems like a fix that only hides the real problem.

Version

v0.52.4

The reproducer branch is based off the libp2p-v0.52.4 tag.

Would you like to work on fixing this bug ?

Yes

nathanielc avatar Oct 31 '23 22:10 nathanielc

We've landed some big internal changes to request-response recently: https://github.com/libp2p/rust-libp2p/pull/3914.

Can you retest with latest master whether or not that still happens?

thomaseizinger avatar Nov 01 '23 01:11 thomaseizinger

I am still able to reproduce the error on master.

Here is a branch with a reproducer against master.

To repro run from within the examples/file-sharing directory:

RUST_LOG=info,libp2p_swarm=debug cargo run -- \
    --listen-address /ip4/0.0.0.0/tcp/40837 \
    --secret-key-seed 1 provide \
    --path README.md \
    --name README.md

and

RUST_LOG=info,libp2p_swarm=debug cargo run --  get --name README.md

Logs:

[2023-11-01T20:37:59Z INFO  libp2p_request_response] handle_established_outbound_connection 3
[2023-11-01T20:37:59Z DEBUG libp2p_swarm] Connection established: PeerId("12D3KooWPjceQrSwdWXPyLLeABRXmuqt69Rg3sBYbU1Nft9HyQ6X") Dialer { address: "/ip4/172.22.0.1/tcp/40837/p2p/12D3KooWPjceQrSwdWXPyLLeABRXmuqt69Rg3sBYbU1Nft9HyQ6X", role_override: Dialer }; Total (peer): 1.
[2023-11-01T20:37:59Z INFO  libp2p_request_response] handle_established_outbound_connection 1
[2023-11-01T20:37:59Z DEBUG libp2p_swarm] Connection established: PeerId("12D3KooWPjceQrSwdWXPyLLeABRXmuqt69Rg3sBYbU1Nft9HyQ6X") Dialer { address: "/ip4/172.17.0.1/tcp/40837/p2p/12D3KooWPjceQrSwdWXPyLLeABRXmuqt69Rg3sBYbU1Nft9HyQ6X", role_override: Dialer }; Total (peer): 2.
[2023-11-01T20:37:59Z INFO  libp2p_request_response] handle_established_outbound_connection 2
[2023-11-01T20:37:59Z INFO  libp2p_request_response] on_dial_failure 2
[2023-11-01T20:37:59Z INFO  libp2p_request_response] handle_established_outbound_connection 4
[2023-11-01T20:37:59Z INFO  libp2p_request_response] on_dial_failure 4

nathanielc avatar Nov 01 '23 20:11 nathanielc

I took a stab at a fix here https://github.com/libp2p/rust-libp2p/pull/4777 open to other ideas for how to fix this.

nathanielc avatar Nov 01 '23 21:11 nathanielc

Hi, I am facing the same errors when running our Filecoin Rust client in debug mode (using libp2p-request-response-0.26.1 if that helps).

elmattic avatar Apr 02 '24 12:04 elmattic

Hi, I am facing the same errors when running our Filecoin Rust client in debug mode (using libp2p-request-response-0.26.1 if that helps).

Have you tried re-ordering your behaviours to have all connection management ones come first? See https://github.com/libp2p/rust-libp2p/pull/4777#issuecomment-1813482146.

thomaseizinger avatar Apr 08 '24 12:04 thomaseizinger