go-libp2p icon indicating copy to clipboard operation
go-libp2p copied to clipboard

swarm: flaky TestDialSimultaneousJoin

Open marten-seemann opened this issue 3 years ago • 11 comments

=== RUN   TestDialSimultaneousJoin
      dial_test.go:578: third dial succedded; conn: <swarm.Conn[*tcp.TcpTransport] /ip4/127.0.0.1/tcp/50014 (12D3KooWRuYVGEsecrJJhZsSoKf1UNdBVYKFCmFLNj9ucZiSQCYj) <-> /ip4/127.0.0.1/tcp/50015 (12D3KooWGEcD5sW5osB6LajkHGqiGc3W8eKfYwnJVVqfujkpLWX2)>
      dial_test.go:560: second dial succedded; conn: <swarm.Conn[*tcp.TcpTransport] /ip4/127.0.0.1/tcp/50014 (12D3KooWRuYVGEsecrJJhZsSoKf1UNdBVYKFCmFLNj9ucZiSQCYj) <-> /ip4/127.0.0.1/tcp/50015 (12D3KooWGEcD5sW5osB6LajkHGqiGc3W8eKfYwnJVVqfujkpLWX2)>
      dial_test.go:[588](https://github.com/libp2p/go-libp2p/runs/6129949662?check_suite_focus=true#step:7:588): 
          	Error Trace:	dial_test.go:588
          	Error:      	Received unexpected error:
          	            	failed to dial 12D3KooWGEcD5sW5osB6LajkHGqiGc3W8eKfYwnJVVqfujkpLWX2:
          	            	  * [/ip4/127.0.0.1/tcp/50016] failed to negotiate security protocol: context deadline exceeded
          	Test:       	TestDialSimultaneousJoin
  --- FAIL: TestDialSimultaneousJoin (0.26s)

marten-seemann avatar Apr 22 '22 14:04 marten-seemann

Assigning myself.

schomatis avatar Jun 23 '22 23:06 schomatis

@vyzo Looking at the code related to TestDialSimultaneousJoin, is it correct that the line we're trying to trigger is:

https://github.com/libp2p/go-libp2p/blob/5eaa48fbab3bf4c669f747437ace19d0311b4c8e/p2p/net/swarm/dial_worker.go#L256-L258

schomatis avatar Jul 12 '22 15:07 schomatis

I don't recall targeting a specific line, just making sure we have a test for joined dials.

vyzo avatar Jul 12 '22 15:07 vyzo

@vyzo Ok, could you point me to 'joined dials' in the code to better understand what are we trying to test, please?

schomatis avatar Jul 12 '22 15:07 schomatis

And particularly how are we enforcing (or approaching) the "simultaneous" part of the test.

schomatis avatar Jul 12 '22 15:07 schomatis

It's the invariant that two concurrent dials to the same addresses are joined.

vyzo avatar Jul 12 '22 15:07 vyzo

Ok, but how do you define concurrent in practice?

schomatis avatar Jul 12 '22 15:07 schomatis

What I'm seeing here is the first dial timeouting before the second one has a chance to hit and I'm trying to figure out how to better guarantee that simultaneity.

schomatis avatar Jul 12 '22 15:07 schomatis

It's the invariant that two concurrent dials to the same addresses are joined.

This extends to 'same peer' also right? (This might be implicit in what you just stated, just double checking because I'm new in libp2p.)

schomatis avatar Jul 12 '22 15:07 schomatis

This extends to 'same peer' also right? (This might be implicit in what you just stated, just double checking because I'm new in libp2p.)

yes, of course -- the dials are peer specific.

vyzo avatar Jul 12 '22 15:07 vyzo

What I'm seeing here is the first dial timeouting before the second one has a chance to hit and I'm trying to figure out how to better guarantee that simultaneity.

Uhm, maybe somehow delay the first dial until the second one happens (with a channel probably). Might need to add some test scaffolding in the code.

vyzo avatar Jul 12 '22 15:07 vyzo