kubo icon indicating copy to clipboard operation
kubo copied to clipboard

fix(test): flaky t0182-circuit-relay.sh

Open lidel opened this issue 3 years ago • 11 comments

This PR removes fixup introduced in https://github.com/ipfs/go-ipfs/pull/8868#discussion_r860219918 so we can dig into the underlying cause.

About the flaky test

It started failing every 3-10 run after we switched to go-libp2p 0.19.0 in https://github.com/ipfs/go-ipfs/pull/8868 – tests in go-ipfs/test/sharness/t0182-circuit-relay.sh became flaky.

ipfsi 2 swarm connect /p2p/$PEERID_1/p2p-circuit/p2p/$PEERID_0 sometimes fails with:

NO_RESERVATION (204)
Error: connect 12D3KooWLuY1Q13o6Q91NoJn4Qv7RSq5rYS6uC4YCzKCJ3S8GnwQ failure: failed to dial 12D3KooWLuY1Q13o6Q91NoJn4Qv7RSq5rYS6uC4YCzKCJ3S8GnwQ:
  * [/p2p/12D3KooWNfs8uFpQf6NnseZKLyS3c8EFppGtUGejHpViBeg8f7Vt/p2p-circuit] error opening relay circuit: NO_RESERVATION (204)`
  • it happens randomly, bumping it to 10 seconds does not help much, it still fails sometimes.
  • shutting down nodes and starting them again fixes the problem (what is what removed code did)
    • ...but it feels like covering up some underlying racy bug: once NO_RESERVATION error is returned, the relay will never work, no matter how long we wait and retry again (only way to fix it is to reboot of the relay node / testbed).

Repro (keep running t0182-circuit-relay.sh until the error):

$ killall ipfs ; i=0; while ./t0182-circuit-relay.sh -v; do echo -n " -----> $i <----\n\n\n"; sleep 1; ((i=i+1)) ; done

cc @aschmahmann @marten-seemann @schomatis – any ideas how to debug this?

lidel avatar May 12 '22 12:05 lidel

Would be interesting to see why we don't get a reservation: GOLOG_LOG_LEVEL="autorelay=debug".

marten-seemann avatar May 12 '22 16:05 marten-seemann

2022-05-12 conversation: this may be relevant for the release if it means a new version of go-libp2p is needed. It may just be a problem with the tests though.

BigLep avatar May 12 '22 17:05 BigLep

Who's maintaining https://github.com/ipfs/iptb? cc @BigLep

schomatis avatar May 12 '22 23:05 schomatis

@schomatis afaik nobody – it was not touched since I remember (2018 or later)

lidel avatar May 16 '22 11:05 lidel

2022-05-26 conversation: we're not viewing this as release blocking for 0.13 because even if there are some issues with circuit relay v2, we at least have mostly working hole punching. 0.13 is better than before.

The ideal quick thing to do is to see if this is a new regression or long standing. We could use an older version and run Lidel's loop command to see if the flakiness manifests.

BigLep avatar May 26 '22 14:05 BigLep

The ideal quick thing to do is to see if this is a new regression or long standing. We could use an older version and run Lidel's loop command to see if the flakiness manifests.

@lidel seems to have already pinpoint this to https://github.com/ipfs/go-ipfs/pull/8868 per OP.

I can get the detailed logs for @marten-seemann early next week but I'm going to push for either us really owning IPTB or replacing it altogether (with testground or whatever we use these days).

schomatis avatar May 26 '22 23:05 schomatis

Sounds upstream tests are flaky as well https://github.com/libp2p/go-libp2p/pull/1576#issuecomment-1140254214

lidel avatar May 31 '22 15:05 lidel

This one-liner hack was the easiest I've found to get the logs of IPTB nodes.

After the failed sharness test the logs are in

NID=0
cat "./trash directory.t0182-circuit-relay.sh/.iptb/testbeds/default/$NID/daemon.stderr"

Note that a successful sharness test run removes this directory (so it's left only after the failed one, copy the log before re-running) and restarting the node (iptb start) within the sharness test also truncates the log file.

I'm seeing an error in node 0, but you'd likely need to extend the log subsystem to get more details:

DEBUG	autorelay	autorelay/relay_finder.go:224	node 12D3KooWFMQMKKrDVsQ3Qwks615pZHiSsDkg57Pafsav53BxxX7T not accepted as a candidate: error connecting to relay 12D3KooWFMQMKKrDVsQ3Qwks615pZHiSsDkg57Pafsav53BxxX7T: failed to dial 12D3KooWFMQMKKrDVsQ3Qwks615pZHiSsDkg57Pafsav53BxxX7T:
  * [/ip4/127.0.0.1/tcp/38569] dial tcp4 127.0.0.1:38569: connect: connection refused

schomatis avatar Jun 01 '22 00:06 schomatis

cc @lidel @marten-seemann

schomatis avatar Jun 01 '22 00:06 schomatis

2022-06-14 conversation: we're going to wait until the next go-libp2p release before investigating this further since there are changes coming.

BigLep avatar Jun 14 '22 16:06 BigLep

go-libp2p v0.21 landed in https://github.com/ipfs/kubo/pull/9074, rebasing this PR to see if the problem is still present

lidel avatar Sep 01 '22 13:09 lidel