go-libp2p icon indicating copy to clipboard operation
go-libp2p copied to clipboard

flaky TestGlobalPreferenceV4 on Ubuntu

Open marten-seemann opened this issue 3 years ago • 5 comments

https://github.com/libp2p/go-libp2p/runs/7028901565?check_suite_focus=true

  === RUN   TestGlobalPreferenceV4
      transport_test.go:140: when listening on /ip4/127.0.0.1/tcp/0, should prefer /ip4/127.0.0.1/tcp/0 over /ip4/10.1.1.88/tcp/0
      transport_test.go:142: when listening on /ip4/127.0.0.1/tcp/0, should prefer /ip4/0.0.0.0/tcp/0 over /ip4/10.1.1.88/tcp/0
      transport_test.go:180: dialed /ip4/127.0.0.1/tcp/46587 from 127.0.0.1:60340. expected to dial from port [45733]
      transport_test.go:193: dialed /ip4/127.0.0.1/tcp/46587 from 127.0.0.1:60344. expected to dial from port [45733]
      transport_test.go:145: when listening on /ip4/10.1.1.88/tcp/0, should prefer /ip4/0.0.0.0/tcp/0 over /ip4/127.0.0.1/tcp/0

marten-seemann avatar Jun 23 '22 18:06 marten-seemann

Assigning myself.

schomatis avatar Jun 23 '22 23:06 schomatis

Working on this.

schomatis avatar Jul 14 '22 23:07 schomatis

We're hitting the global/fallback dialer

https://github.com/libp2p/go-libp2p/blob/7facd81bba8889ee33ac6b4235cb569e652e32eb/p2p/net/reuseport/reuseport.go#L32

which doesn't have the reuse-port settings like the local variable

https://github.com/libp2p/go-libp2p/blob/7facd81bba8889ee33ac6b4235cb569e652e32eb/p2p/net/reuseport/reuseport.go#L18-L21

I'm thinking we can either:

  1. Retry using another local/ephemeral Dialer.
  2. Keep using the fallbackDialer but with the Control set (and maybe unset later if needed; not sure how's the concurrency here).

@Stebalien Could you expand on the rationale of the fallbackDialer added in https://github.com/libp2p/go-libp2p/commit/0b6d56b5e46bb74c8a8c63e9f36f295c3474ddcf, please? That would give me more context to know which way to go.

schomatis avatar Jul 19 '22 14:07 schomatis

We hit the fallback dialer if dialing with reuseport fails for some reason. We do this because reuseport doesn't always work:

  • It has some issues on some operating systems.
  • If there was a previous connection that's now in the TIME_WAIT state, re-creating the same connection with the same source port will fail (so we'll use a random source port with the fallback dialer).

The first thing to investigate would be why the first dial is failing.

Stebalien avatar Aug 01 '22 17:08 Stebalien

The first thing to investigate would be why the first dial is failing.

Maybe the other side is dialing us automatically for some reason? Or it could be some form of spurious failure.

Stebalien avatar Aug 01 '22 17:08 Stebalien

https://github.com/libp2p/go-libp2p/blob/7facd81bba8889ee33ac6b4235cb569e652e32eb/p2p/net/reuseport/transport_test.go#L165-L180

What I'm seeing locally on my 5.4.0-99-generic (which may not be the same issue as on the CI server) is the random port assigned to listener B with (in the case of this test) the interface address can be a port already used by the system, say, in my host, the containerd (root) process port: 192.168.26.140:42315. Then when dialing from that transport as the source, trying to bind to the same port, but now with the all-zeros address (0.0.0.0:42315), it fails with EADDRINUSE.

From my limited understanding of address/port reuse rules this isn't what I would have expected, so I might be misinterpreting some part of the failed test. Will double check to confirm the above.

@Stebalien Does any of this make sense to you?

schomatis avatar Aug 11 '22 01:08 schomatis

There shouldn't be a conflict between the host and the container. However, there may be a conflict between 127.0.0.1 and the external address?

Are you sure the container isn't just auto-forwarding ports?

Stebalien avatar Aug 11 '22 01:08 Stebalien

@Stebalien Sorry, the containerd was just a random example of an open port I'm hitting in my local setting, but this can happen with any open port (in a certain range I imagine). Forgetting containers altogether, this is what I'm seeing (for a random port in the normal range, say, 41071):

Everything works if the port is not bound by the root process.

Bind to wildcard 0.0.0.0:41071 (sudo nc -l 41071) and the listen just fails:

listen tcp4 192.168.26.140:41071: bind: address already in use

Bind to localhost 127.0.0.1:41071 (sudo nc -l 127.0.0.1 41071) and the listen succeeds but then the dial from that port (which is done from the 0.0.0.0 address) will fail to bind:

tcp4 0.0.0.0:41071->127.0.0.1:46327: bind: address already in use

This was tested forcing the listener on the 41071 port, but if I leave the random 0, with enough tries, I can see that it gets assigned 41071. (The listener in this test uses the interface address, here 192.168.26.140.)

schomatis avatar Aug 12 '22 00:08 schomatis

Yeah, that makes sense. So this looks like an actual bug. Basically, we shouldn't do:

https://github.com/libp2p/go-libp2p/blob/68722aa1e9c31d61cf3e4e1dc9fdfa2e578e9ae4/p2p/net/reuseport/dial.go#L106-L112

or:

https://github.com/libp2p/go-libp2p/blob/68722aa1e9c31d61cf3e4e1dc9fdfa2e578e9ae4/p2p/net/reuseport/multidialer.go#L63

Instead, if we "don't know", we should just not set a local address.

I've filed a patch here because, well, I wrote this code and it was... overcomplicated.

Stebalien avatar Aug 12 '22 01:08 Stebalien

https://github.com/libp2p/go-libp2p/pull/1673

Stebalien avatar Aug 12 '22 01:08 Stebalien