tests(multipath): add failing test
Description
This adds a simple test: one server that accepts connections, and then 3 clients that are created consecutively and each connect to the server, exchange data, close connection, close endpoint. After a client is finished the next client is created (new endpoint, new secret key).
The test comes in three variations that share their impl:
test_one_server_two_clients_local_relay: uses a local relaytest_one_server_two_clients_public_relay: uses the public relaytest_one_server_two_clients_no_relay: doesn't use relays at all
The no relay and local relay versions pass for me. However the public relay one times out but only for the second client. The first one succeeds, the second connect attempt times out at the client.
Breaking Changes
Notes & open questions
Change checklist
- [ ] Self-review.
- [ ] Documentation updates following the style guide, if relevant.
- [ ] Tests if relevant.
- [ ] All breaking changes documented.
- [ ] List all breaking changes in the above "Breaking Changes" section.
- [ ] Open an issue or PR on any number0 repos that are affected by this breaking change. Give guidance on how the updates should be handled or do the actual updates themselves. The major ones are:
- [ ]
quic-rpc - [ ]
iroh-gossip - [ ]
iroh-blobs - [ ]
dumbpipe - [ ]
sendme
- [ ]
Documentation for this PR has been generated and is available at: https://n0-computer.github.io/iroh/pr/3689/docs/iroh/
Last updated: 2025-11-25T13:44:56Z
Netsim report & logs for this PR have been generated and is available at: LOGS This report will remain available for 3 days.
Last updated for commit: 2c0802e
initial investigation for test_one_server_two_clients_public_relay:
once the second client starts to connect, I do get the logs for it sending datagrams via the RemoteStateActor on all paths (ip and relay). on the server, when browsing the logs from that send timestamp forward, however I get these logs only:
2025-11-20T20:54:02.226494Z TRACE server: netwatch::udp: UDP recv src=192.168.18.55:32827 len=1200 count=1 dst=192.168.18.55
2025-11-20T20:54:02.226522Z TRACE server: iroh::magicsock: UDP recv: 1 packets
2025-11-20T20:54:02.554427Z TRACE server:relay-actor:active-relay{url=https://euc1-1.relay.n0.iroh-canary.iroh.link./}:connected: iroh::magicsock::transports::relay::actor: received msg len=1200
2025-11-20T20:54:02.554634Z TRACE server: iroh::magicsock::mapped_addrs: generated new addr addr=RelayMappedAddr(fd15:70a:510b:1::3) key=(RelayUrl("https://euc1-1.relay.n0.iroh-canary.iroh.link./"), PublicKey(5965936c12b5b298af175fed203e78ef0f173a60786efb89575464463f9cdea7))
and .. that's it? I get some further logs for a connection that is can be identified as the older (closed!) connection from the first client.
I don't see any quinn_proto recv lines for this datagram received from the second client, which is weird - it seems these initial packets are received by the server, get up to the point where "generated new addr" is logged, but don't make its way into the quinn endpoint.
Interestingly, CI fails only for ubuntu-stable features none or default, all others pass! I can reproduce the failure reliably locally though.
Some findings. Looking at the logs of the test with the public relay. First client works fine. Now second client starts. Looking at the logs of the server for the second client:
- Client initial packet arrives over the IP path
quinn_proto::Endpoint::handleemitsSome(DatagramEvent::NewConnection). - Client initial packet arrives over the relay path,
Endpoint::handleemitsNone, internallyRouteTo::Initial(0)is the case that leads tohandlereturningNone - Client packet packet arrives over the IP. Strangely,
quinn_proto::Endpoint::handlenow returnsNoneinstead of aDatagramEvent::ConnectionEvent, inhandleit hits again theRouteTo::Initial(0)case.
Comparing to the first successfull flow, in the first client attempt we have this flow:
- First packet over IP arrives ->
DatagramEvent::NewConnection - Second packet over IP arrives ->
DatagramEvent::ConnectionEvent
So in the working case the second packet over IP also has a different dst_cid than the first packet, whereas in the second (broken) client the second packet over IP has the same dst_cid than the first client.
Attaching logs here (with additional debug log statements inserted by me in iroh_quinn). log-public.txt
Ooh I think I found the reason! What happens is this (thanks qlog!!):
- Server and client create endpoint, connect to public relay
- Client connects to server, has its relay and ip addr
- Client sends Initial over all paths (ip and relay)
- Server receives Initial over IP path, new connection is created, flow continues as it should
- Server receives Initial again over Relay path quite a bit later. It creates another Connection
- Test hangs because it sequentially accepts incomings, and this second connection never proceeds because no further packets arrive via that path
So this step 5) seems wrong - it should discard the second Initial instead of initiating a new connection.
So this step 5) seems wrong - it should discard the second Initial instead of initiating a new connection.
Is the original connection closed before it receives the second initial, or are there two simultaneous connections?
Is the original connection closed before it receives the second initial, or are there two simultaneous connections?
not sure yet, rerunning with more logs now.
Is the original connection closed before it receives the second initial, or are there two simultaneous connections?
The original connection is already closed at the server when the server receives the second, duplicate initial. So maybe there's not much we can do about this apart from advising to always, even in tests, spawn new tasks for awaiting incomings, and not care about them timing out. It's still a bit unfortunate, because the server will keep sending its initial until the 30s timeout hits on this bogus connection.
@Frando can you update these and see if they're still relevant?