CI Timeout issues
Last ~20 days or so we have started seeing CI jobs fail due to stuff seemingly timing out. The ones affected I've seen so far are: netsim, cross tests, docker. It's concerning because netsim and docker tests run a plain binary and exhibit this behavior. It's become more prominent in the last 2 weeks and presumably is also affecting users.
Sample links: https://github.com/n0-computer/iroh/actions/runs/11342921218/job/31544406620?pr=2804 https://github.com/n0-computer/iroh/actions/runs/11354934395/job/31583241218?pr=2807 https://github.com/n0-computer/iroh/actions/runs/11370155916/job/31629269957?pr=2808
Some logs from netsim:
cmd: ./bins/iroh start --add data/1G.bin
2024-10-15T09:52:54.736754Z WARN iroh_net::netcheck::reportgen: tick: probes timed out
2024-10-15T09:52:56.737453Z WARN iroh_net::magicsock: failed to generate netcheck report for: periodic: report aborted
Caused by:
report timed out
2024-10-15T09:53:24.303831Z WARN iroh_net::netcheck::reportgen: tick: probes timed out
2024-10-15T09:53:26.304470Z WARN iroh_net::magicsock: failed to generate netcheck report for: periodic: report aborted
Caused by:
report timed out
2024-10-15T09:53:54.739896Z WARN iroh_net::netcheck::reportgen: tick: probes timed out
2024-10-15T09:53:56.740046Z WARN iroh_net::magicsock: failed to generate netcheck report for: periodic: report aborted
Caused by:
report timed out
2024-10-15T09:54:20.265772Z WARN iroh_net::netcheck::reportgen: tick: probes timed out
2024-10-15T09:54:22.265803Z WARN iroh_net::magicsock: failed to generate netcheck report for: periodic: report aborted
Caused by:
report timed out
2024-10-15T09:54:50.253689Z WARN iroh_net::netcheck::reportgen: tick: probes timed out
2024-10-15T09:54:52.253677Z WARN iroh_net::magicsock: failed to generate netcheck report for: periodic: report aborted
Caused by:
report timed out
2024-10-15T09:55:16.653801Z WARN iroh_net::netcheck::reportgen: tick: probes timed out
2024-10-15T09:55:18.654037Z WARN iroh_net::magicsock: failed to generate netcheck report for: periodic: report aborted
Caused by:
report timed out
2024-10-15T09:55:46.172940Z WARN iroh_net::netcheck::reportgen: tick: probes timed out
2024-10-15T09:55:48.172981Z WARN iroh_net::magicsock: failed to generate netcheck report for: periodic: report aborted
Caused by:
report timed out
2024-10-15T09:56:14.136309Z WARN iroh_net::netcheck::reportgen: tick: probes timed out
2024-10-15T09:56:16.136141Z WARN iroh_net::magicsock: failed to generate netcheck report for: periodic: report aborted
Caused by:
report timed out
2024-10-15T09:56:41.597054Z WARN iroh_net::netcheck::reportgen: tick: probes timed out
2024-10-15T09:56:43.597030Z WARN iroh_net::magicsock: failed to generate netcheck report for: periodic: report aborted
Caused by:
report timed out
2024-10-15T09:57:06.907840Z WARN iroh_net::netcheck::reportgen: tick: probes timed out
2024-10-15T09:57:08.907963Z WARN iroh_net::magicsock: failed to generate netcheck report for: periodic: report aborted
Some context from discord:
Those are the logs from the failing node running in netsim
Ie the chain of events is:
- flaky start with ./bins/iroh start --add data/1G.bin
- gets in an infinite loop of tick: probes timed out
- netsim waits for 10 sec and keeps checking output to see if it has generated a ticket
- ticket never comes, netsim nukes the process and exits with 1
- netsim job fails
in the past 3 days this has happened:
- many times in netsim
- couple times in docker tests
- now in cross
- unsure but think I've seen one or two elsewhere but at the time didn't document the links
netsim is the most likely to hang as it runs many nodes & sims with the full binary
docker less likely but also happened enough times I caught wind of it
cross is the first time I've seen it but just tells me it might be worse than expected / deeper than just cli hangs
Seems like this is also one of those shrodinger bugs, if you look at it, you cant reproduce. Running like 15 tests with debug logging and suddenly it just works.
Seems like its a timing/race somewhere. Or the setup just got significantly slower somehow.
@Arqu where do i find those logs that you print extracts from? Following the links just gives me opaque failure messages without a debug log of iroh.
I've been trying to get a fresh set of these, when you run the netsim job with debug logs you get that output but I had trouble getting it to fail with debug logs enabled again. The original ones I got by sshing into the box directly.
One contributing bug was https://github.com/n0-computer/iroh/issues/2876
Stale