go-libp2p
go-libp2p copied to clipboard
flaky TestGlobalPreferenceV4 on Ubuntu
https://github.com/libp2p/go-libp2p/runs/7028901565?check_suite_focus=true
=== RUN TestGlobalPreferenceV4
transport_test.go:140: when listening on /ip4/127.0.0.1/tcp/0, should prefer /ip4/127.0.0.1/tcp/0 over /ip4/10.1.1.88/tcp/0
transport_test.go:142: when listening on /ip4/127.0.0.1/tcp/0, should prefer /ip4/0.0.0.0/tcp/0 over /ip4/10.1.1.88/tcp/0
transport_test.go:180: dialed /ip4/127.0.0.1/tcp/46587 from 127.0.0.1:60340. expected to dial from port [45733]
transport_test.go:193: dialed /ip4/127.0.0.1/tcp/46587 from 127.0.0.1:60344. expected to dial from port [45733]
transport_test.go:145: when listening on /ip4/10.1.1.88/tcp/0, should prefer /ip4/0.0.0.0/tcp/0 over /ip4/127.0.0.1/tcp/0
Assigning myself.
Working on this.
We're hitting the global/fallback dialer
https://github.com/libp2p/go-libp2p/blob/7facd81bba8889ee33ac6b4235cb569e652e32eb/p2p/net/reuseport/reuseport.go#L32
which doesn't have the reuse-port settings like the local variable
https://github.com/libp2p/go-libp2p/blob/7facd81bba8889ee33ac6b4235cb569e652e32eb/p2p/net/reuseport/reuseport.go#L18-L21
I'm thinking we can either:
- Retry using another local/ephemeral
Dialer. - Keep using the
fallbackDialerbut with theControlset (and maybe unset later if needed; not sure how's the concurrency here).
@Stebalien Could you expand on the rationale of the fallbackDialer added in https://github.com/libp2p/go-libp2p/commit/0b6d56b5e46bb74c8a8c63e9f36f295c3474ddcf, please? That would give me more context to know which way to go.
We hit the fallback dialer if dialing with reuseport fails for some reason. We do this because reuseport doesn't always work:
- It has some issues on some operating systems.
- If there was a previous connection that's now in the TIME_WAIT state, re-creating the same connection with the same source port will fail (so we'll use a random source port with the fallback dialer).
The first thing to investigate would be why the first dial is failing.
The first thing to investigate would be why the first dial is failing.
Maybe the other side is dialing us automatically for some reason? Or it could be some form of spurious failure.
https://github.com/libp2p/go-libp2p/blob/7facd81bba8889ee33ac6b4235cb569e652e32eb/p2p/net/reuseport/transport_test.go#L165-L180
What I'm seeing locally on my 5.4.0-99-generic (which may not be the same issue as on the CI server) is the random port assigned to listener B with (in the case of this test) the interface address can be a port already used by the system, say, in my host, the containerd (root) process port: 192.168.26.140:42315. Then when dialing from that transport as the source, trying to bind to the same port, but now with the all-zeros address (0.0.0.0:42315), it fails with EADDRINUSE.
From my limited understanding of address/port reuse rules this isn't what I would have expected, so I might be misinterpreting some part of the failed test. Will double check to confirm the above.
@Stebalien Does any of this make sense to you?
There shouldn't be a conflict between the host and the container. However, there may be a conflict between 127.0.0.1 and the external address?
Are you sure the container isn't just auto-forwarding ports?
@Stebalien Sorry, the containerd was just a random example of an open port I'm hitting in my local setting, but this can happen with any open port (in a certain range I imagine). Forgetting containers altogether, this is what I'm seeing (for a random port in the normal range, say, 41071):
Everything works if the port is not bound by the root process.
Bind to wildcard 0.0.0.0:41071 (sudo nc -l 41071) and the listen just fails:
listen tcp4 192.168.26.140:41071: bind: address already in use
Bind to localhost 127.0.0.1:41071 (sudo nc -l 127.0.0.1 41071) and the listen succeeds but then the dial from that port (which is done from the 0.0.0.0 address) will fail to bind:
tcp4 0.0.0.0:41071->127.0.0.1:46327: bind: address already in use
This was tested forcing the listener on the 41071 port, but if I leave the random 0, with enough tries, I can see that it gets assigned 41071. (The listener in this test uses the interface address, here 192.168.26.140.)
Yeah, that makes sense. So this looks like an actual bug. Basically, we shouldn't do:
https://github.com/libp2p/go-libp2p/blob/68722aa1e9c31d61cf3e4e1dc9fdfa2e578e9ae4/p2p/net/reuseport/dial.go#L106-L112
or:
https://github.com/libp2p/go-libp2p/blob/68722aa1e9c31d61cf3e4e1dc9fdfa2e578e9ae4/p2p/net/reuseport/multidialer.go#L63
Instead, if we "don't know", we should just not set a local address.
I've filed a patch here because, well, I wrote this code and it was... overcomplicated.
https://github.com/libp2p/go-libp2p/pull/1673