iroh icon indicating copy to clipboard operation
iroh copied to clipboard

CI Timeout issues

Open Arqu opened this issue 1 year ago • 3 comments

Last ~20 days or so we have started seeing CI jobs fail due to stuff seemingly timing out. The ones affected I've seen so far are: netsim, cross tests, docker. It's concerning because netsim and docker tests run a plain binary and exhibit this behavior. It's become more prominent in the last 2 weeks and presumably is also affecting users.

Sample links: https://github.com/n0-computer/iroh/actions/runs/11342921218/job/31544406620?pr=2804 https://github.com/n0-computer/iroh/actions/runs/11354934395/job/31583241218?pr=2807 https://github.com/n0-computer/iroh/actions/runs/11370155916/job/31629269957?pr=2808

Some logs from netsim:

cmd: ./bins/iroh start --add data/1G.bin

2024-10-15T09:52:54.736754Z  WARN iroh_net::netcheck::reportgen: tick: probes timed out
2024-10-15T09:52:56.737453Z  WARN iroh_net::magicsock: failed to generate netcheck report for: periodic: report aborted

Caused by:
    report timed out
2024-10-15T09:53:24.303831Z  WARN iroh_net::netcheck::reportgen: tick: probes timed out
2024-10-15T09:53:26.304470Z  WARN iroh_net::magicsock: failed to generate netcheck report for: periodic: report aborted

Caused by:
    report timed out
2024-10-15T09:53:54.739896Z  WARN iroh_net::netcheck::reportgen: tick: probes timed out
2024-10-15T09:53:56.740046Z  WARN iroh_net::magicsock: failed to generate netcheck report for: periodic: report aborted

Caused by:
    report timed out
2024-10-15T09:54:20.265772Z  WARN iroh_net::netcheck::reportgen: tick: probes timed out
2024-10-15T09:54:22.265803Z  WARN iroh_net::magicsock: failed to generate netcheck report for: periodic: report aborted

Caused by:
    report timed out
2024-10-15T09:54:50.253689Z  WARN iroh_net::netcheck::reportgen: tick: probes timed out
2024-10-15T09:54:52.253677Z  WARN iroh_net::magicsock: failed to generate netcheck report for: periodic: report aborted

Caused by:
    report timed out
2024-10-15T09:55:16.653801Z  WARN iroh_net::netcheck::reportgen: tick: probes timed out
2024-10-15T09:55:18.654037Z  WARN iroh_net::magicsock: failed to generate netcheck report for: periodic: report aborted

Caused by:
    report timed out
2024-10-15T09:55:46.172940Z  WARN iroh_net::netcheck::reportgen: tick: probes timed out
2024-10-15T09:55:48.172981Z  WARN iroh_net::magicsock: failed to generate netcheck report for: periodic: report aborted

Caused by:
    report timed out
2024-10-15T09:56:14.136309Z  WARN iroh_net::netcheck::reportgen: tick: probes timed out
2024-10-15T09:56:16.136141Z  WARN iroh_net::magicsock: failed to generate netcheck report for: periodic: report aborted

Caused by:
    report timed out
2024-10-15T09:56:41.597054Z  WARN iroh_net::netcheck::reportgen: tick: probes timed out
2024-10-15T09:56:43.597030Z  WARN iroh_net::magicsock: failed to generate netcheck report for: periodic: report aborted

Caused by:
    report timed out
2024-10-15T09:57:06.907840Z  WARN iroh_net::netcheck::reportgen: tick: probes timed out
2024-10-15T09:57:08.907963Z  WARN iroh_net::magicsock: failed to generate netcheck report for: periodic: report aborted

Some context from discord:

Those are the logs from the failing node running in netsim
Ie the chain of events is:
- flaky start with ./bins/iroh start --add data/1G.bin
- gets in an infinite loop of tick: probes timed out
- netsim waits for 10 sec and keeps checking output to see if it has generated a ticket
- ticket never comes, netsim nukes the process and exits with 1
- netsim job fails


in the past 3 days this has happened:
- many times in netsim
- couple times in docker tests
- now in cross
- unsure but think I've seen one or two elsewhere but at the time didn't document the links

netsim is the most likely to hang as it runs many nodes & sims with the full binary
docker less likely but also happened enough times I caught wind of it
cross is the first time I've seen it but just tells me it might be worse than expected / deeper than just cli hangs

Arqu avatar Oct 28 '24 14:10 Arqu

Seems like this is also one of those shrodinger bugs, if you look at it, you cant reproduce. Running like 15 tests with debug logging and suddenly it just works.

Seems like its a timing/race somewhere. Or the setup just got significantly slower somehow.

Arqu avatar Oct 29 '24 10:10 Arqu

@Arqu where do i find those logs that you print extracts from? Following the links just gives me opaque failure messages without a debug log of iroh.

flub avatar Oct 29 '24 11:10 flub

I've been trying to get a fresh set of these, when you run the netsim job with debug logs you get that output but I had trouble getting it to fail with debug logs enabled again. The original ones I got by sshing into the box directly.

Arqu avatar Oct 29 '24 11:10 Arqu

One contributing bug was https://github.com/n0-computer/iroh/issues/2876

dignifiedquire avatar Nov 04 '24 13:11 dignifiedquire

Stale

Arqu avatar Sep 24 '25 20:09 Arqu