iroh tests(multipath): add failing test

Description

This adds a simple test: one server that accepts connections, and then 3 clients that are created consecutively and each connect to the server, exchange data, close connection, close endpoint. After a client is finished the next client is created (new endpoint, new secret key).

The test comes in three variations that share their impl:

test_one_server_two_clients_local_relay: uses a local relay
test_one_server_two_clients_public_relay: uses the public relay
test_one_server_two_clients_no_relay: doesn't use relays at all

The no relay and local relay versions pass for me. However the public relay one times out but only for the second client. The first one succeeds, the second connect attempt times out at the client.

Breaking Changes

Notes & open questions

Change checklist

[ ] Self-review.
[ ] Documentation updates following the style guide, if relevant.
[ ] Tests if relevant.
[ ] All breaking changes documented.
- [ ] List all breaking changes in the above "Breaking Changes" section.
- [ ] Open an issue or PR on any number0 repos that are affected by this breaking change. Give guidance on how the updates should be handled or do the actual updates themselves. The major ones are:
  - [ ] quic-rpc
  - [ ] iroh-gossip
  - [ ] iroh-blobs
  - [ ] dumbpipe
  - [ ] sendme

Nov 20 '25 17:11 Frando

Documentation for this PR has been generated and is available at: https://n0-computer.github.io/iroh/pr/3689/docs/iroh/

Last updated: 2025-11-25T13:44:56Z

Nov 20 '25 17:11 github-actions[bot]

Netsim report & logs for this PR have been generated and is available at: LOGS This report will remain available for 3 days.

Last updated for commit: 2c0802e

Nov 20 '25 17:11 github-actions[bot]

initial investigation for test_one_server_two_clients_public_relay: once the second client starts to connect, I do get the logs for it sending datagrams via the RemoteStateActor on all paths (ip and relay). on the server, when browsing the logs from that send timestamp forward, however I get these logs only:

2025-11-20T20:54:02.226494Z TRACE server: netwatch::udp: UDP recv src=192.168.18.55:32827 len=1200 count=1 dst=192.168.18.55
2025-11-20T20:54:02.226522Z TRACE server: iroh::magicsock: UDP recv: 1 packets
2025-11-20T20:54:02.554427Z TRACE server:relay-actor:active-relay{url=https://euc1-1.relay.n0.iroh-canary.iroh.link./}:connected: iroh::magicsock::transports::relay::actor: received msg len=1200
2025-11-20T20:54:02.554634Z TRACE server: iroh::magicsock::mapped_addrs: generated new addr addr=RelayMappedAddr(fd15:70a:510b:1::3) key=(RelayUrl("https://euc1-1.relay.n0.iroh-canary.iroh.link./"), PublicKey(5965936c12b5b298af175fed203e78ef0f173a60786efb89575464463f9cdea7))

and .. that's it? I get some further logs for a connection that is can be identified as the older (closed!) connection from the first client.

I don't see any quinn_proto recv lines for this datagram received from the second client, which is weird - it seems these initial packets are received by the server, get up to the point where "generated new addr" is logged, but don't make its way into the quinn endpoint.

Nov 20 '25 21:11 Frando

Interestingly, CI fails only for ubuntu-stable features none or default, all others pass! I can reproduce the failure reliably locally though.

Nov 20 '25 21:11 Frando

Some findings. Looking at the logs of the test with the public relay. First client works fine. Now second client starts. Looking at the logs of the server for the second client:

Client initial packet arrives over the IP path quinn_proto::Endpoint::handle emits Some(DatagramEvent::NewConnection).
Client initial packet arrives over the relay path, Endpoint::handle emits None, internally RouteTo::Initial(0) is the case that leads to handle returning None
Client packet packet arrives over the IP. Strangely, quinn_proto::Endpoint::handle now returns None instead of a DatagramEvent::ConnectionEvent, in handle it hits again the RouteTo::Initial(0) case.

Comparing to the first successfull flow, in the first client attempt we have this flow:

First packet over IP arrives -> DatagramEvent::NewConnection
Second packet over IP arrives -> DatagramEvent::ConnectionEvent

So in the working case the second packet over IP also has a different dst_cid than the first packet, whereas in the second (broken) client the second packet over IP has the same dst_cid than the first client.

Attaching logs here (with additional debug log statements inserted by me in iroh_quinn). log-public.txt

Nov 22 '25 14:11 Frando

Ooh I think I found the reason! What happens is this (thanks qlog!!):

Server and client create endpoint, connect to public relay
Client connects to server, has its relay and ip addr
Client sends Initial over all paths (ip and relay)
Server receives Initial over IP path, new connection is created, flow continues as it should
Server receives Initial again over Relay path quite a bit later. It creates another Connection
Test hangs because it sequentially accepts incomings, and this second connection never proceeds because no further packets arrive via that path

So this step 5) seems wrong - it should discard the second Initial instead of initiating a new connection.

Nov 25 '25 11:11 Frando

So this step 5) seems wrong - it should discard the second Initial instead of initiating a new connection.

Is the original connection closed before it receives the second initial, or are there two simultaneous connections?

Nov 25 '25 11:11 matheus23

Is the original connection closed before it receives the second initial, or are there two simultaneous connections?

not sure yet, rerunning with more logs now.

Nov 25 '25 12:11 Frando

Is the original connection closed before it receives the second initial, or are there two simultaneous connections?

The original connection is already closed at the server when the server receives the second, duplicate initial. So maybe there's not much we can do about this apart from advising to always, even in tests, spawn new tasks for awaiting incomings, and not care about them timing out. It's still a bit unfortunate, because the server will keep sending its initial until the 30s timeout hits on this bogus connection.

Nov 25 '25 12:11 Frando

@Frando can you update these and see if they're still relevant?

Dec 22 '25 17:12 flub