fix(test): flaky t0182-circuit-relay.sh
This PR removes fixup introduced in https://github.com/ipfs/go-ipfs/pull/8868#discussion_r860219918 so we can dig into the underlying cause.
About the flaky test
It started failing every 3-10 run after we switched to go-libp2p 0.19.0 in https://github.com/ipfs/go-ipfs/pull/8868 – tests in go-ipfs/test/sharness/t0182-circuit-relay.sh became flaky.
ipfsi 2 swarm connect /p2p/$PEERID_1/p2p-circuit/p2p/$PEERID_0 sometimes fails with:
NO_RESERVATION (204)
Error: connect 12D3KooWLuY1Q13o6Q91NoJn4Qv7RSq5rYS6uC4YCzKCJ3S8GnwQ failure: failed to dial 12D3KooWLuY1Q13o6Q91NoJn4Qv7RSq5rYS6uC4YCzKCJ3S8GnwQ:
* [/p2p/12D3KooWNfs8uFpQf6NnseZKLyS3c8EFppGtUGejHpViBeg8f7Vt/p2p-circuit] error opening relay circuit: NO_RESERVATION (204)`
- it happens randomly, bumping it to 10 seconds does not help much, it still fails sometimes.
- shutting down nodes and starting them again fixes the problem (what is what removed code did)
- ...but it feels like covering up some underlying racy bug: once
NO_RESERVATIONerror is returned, the relay will never work, no matter how long we wait and retry again (only way to fix it is to reboot of the relay node / testbed).
- ...but it feels like covering up some underlying racy bug: once
Repro (keep running t0182-circuit-relay.sh until the error):
$ killall ipfs ; i=0; while ./t0182-circuit-relay.sh -v; do echo -n " -----> $i <----\n\n\n"; sleep 1; ((i=i+1)) ; done
cc @aschmahmann @marten-seemann @schomatis – any ideas how to debug this?
Would be interesting to see why we don't get a reservation: GOLOG_LOG_LEVEL="autorelay=debug".
2022-05-12 conversation: this may be relevant for the release if it means a new version of go-libp2p is needed. It may just be a problem with the tests though.
Who's maintaining https://github.com/ipfs/iptb? cc @BigLep
@schomatis afaik nobody – it was not touched since I remember (2018 or later)
2022-05-26 conversation: we're not viewing this as release blocking for 0.13 because even if there are some issues with circuit relay v2, we at least have mostly working hole punching. 0.13 is better than before.
The ideal quick thing to do is to see if this is a new regression or long standing. We could use an older version and run Lidel's loop command to see if the flakiness manifests.
The ideal quick thing to do is to see if this is a new regression or long standing. We could use an older version and run Lidel's loop command to see if the flakiness manifests.
@lidel seems to have already pinpoint this to https://github.com/ipfs/go-ipfs/pull/8868 per OP.
I can get the detailed logs for @marten-seemann early next week but I'm going to push for either us really owning IPTB or replacing it altogether (with testground or whatever we use these days).
Sounds upstream tests are flaky as well https://github.com/libp2p/go-libp2p/pull/1576#issuecomment-1140254214
This one-liner hack was the easiest I've found to get the logs of IPTB nodes.
After the failed sharness test the logs are in
NID=0
cat "./trash directory.t0182-circuit-relay.sh/.iptb/testbeds/default/$NID/daemon.stderr"
Note that a successful sharness test run removes this directory (so it's left only after the failed one, copy the log before re-running) and restarting the node (iptb start) within the sharness test also truncates the log file.
I'm seeing an error in node 0, but you'd likely need to extend the log subsystem to get more details:
DEBUG autorelay autorelay/relay_finder.go:224 node 12D3KooWFMQMKKrDVsQ3Qwks615pZHiSsDkg57Pafsav53BxxX7T not accepted as a candidate: error connecting to relay 12D3KooWFMQMKKrDVsQ3Qwks615pZHiSsDkg57Pafsav53BxxX7T: failed to dial 12D3KooWFMQMKKrDVsQ3Qwks615pZHiSsDkg57Pafsav53BxxX7T:
* [/ip4/127.0.0.1/tcp/38569] dial tcp4 127.0.0.1:38569: connect: connection refused
cc @lidel @marten-seemann
2022-06-14 conversation: we're going to wait until the next go-libp2p release before investigating this further since there are changes coming.
go-libp2p v0.21 landed in https://github.com/ipfs/kubo/pull/9074, rebasing this PR to see if the problem is still present