fix(quic): improve listener socket selection when dialing
Problem
Say the local node listens on /ip4/0.0.0.0/udp/0/quic-v1. One would expect a consecutive dial to /ip4/127.0.0.1/udp/42/quic-v1 to use one of the listen addresses as the source address of the dial.
That is not the case today. Instead libp2p-quic will create a new quinn endpoint and thus a new UDP socket for the outgoing dial.
Why?
In TCP we track the specific listen addresses, even if the user calls listen_on with an unspecified address (e.g. 0.0.0.0). If the user calls listen_on with a specific IP address i.e. not a wildcard like 0.0.0.0 but e.g. 127.0.0.1 we track (here register) the specific address right away.
https://github.com/libp2p/rust-libp2p/blob/f10f1a274abe6e5193de653aa4dd951ae56c775e/transports/tcp/src/lib.rs#L394
In case where the user calls listen_on with an unspecified address (e.g. 0.0.0.0) we depend on if-watch, more specifically poll_if_watch to provide us with the concrete addresses:
https://github.com/libp2p/rust-libp2p/blob/f10f1a274abe6e5193de653aa4dd951ae56c775e/transports/tcp/src/lib.rs#L666-L682
Either way, we end up tracking specific listen addresses with port-reuse enabled and can thus choose the right listener in local_dial_addr:
https://github.com/libp2p/rust-libp2p/blob/f10f1a274abe6e5193de653aa4dd951ae56c775e/transports/tcp/src/lib.rs#L122-L151
In libp2p-quic we only track the initial address that the user provided on the listen_on call.
https://github.com/libp2p/rust-libp2p/blob/f10f1a274abe6e5193de653aa4dd951ae56c775e/transports/quic/src/endpoint.rs#L170-L178
This might be a wildcard address. Later on, when dialing a 127.0.0.1 address, the check listen_addr.ip().is_loopback() == socket_addr.ip().is_loopback() discards the listener with the unspecified listen address, thus a new dial-only socket is used, even though one of the existing listener sockets is available.
This is e.g. problematic when trying to dial a node on 127.0.0.1 with the expectation that the dial uses one of the listen addresses as a source address.
How does go-libp2p solve this:
go-libp2p asks the OS for the correct interface given a destination address. It uses Google's routing package. See https://github.com/libp2p/go-libp2p/blob/260b9695cafdd8e35ec65b30ef153f0c15549c72/p2p/transport/quicreuse/reuse.go#L209 for concrete usage and implementation.
How to move forward?
I am not sure simply removing the listen_addr.ip().is_loopback() == socket_addr.ip().is_loopback() check is the way to go. This check is still helpful. E.g. say we have two listeners, one on localhost, one not on localhost. Ideally we would use the former when dialing.
We could mirror what libp2p-tcp does. That is, track the specified listen addresses reported by if-watch. Later, when dialing, choose the listener with the right address.
Ideally I would like to have the operating system choose. It has the most information about all available routes to each interface. In other words ideally we would do what go-libp2p does.
For now, I suggest we go with the libp2p-tcp approach. A second iteration would mirror what go-libp2p does.
Originally posted by @mxinden in https://github.com/libp2p/rust-libp2p/pull/3454#discussion_r1276338263
Thank you @kpp for surfacing this bug.
Thank you @marten-seemann for the quick help, pointing me to how go-libp2p solves this issue.
I would like to work on this but I have a question. If using the same approach as in libp2p-tcp, should we also have a configuration related to port_reuse: bool (false by default) as in libp2p::tcp::Config, or make it true by default so reuse an existing listener?
Great to hear that you are interested.
https://github.com/libp2p/rust-libp2p/issues/4226
port_reuse in libp2p-quic will be introduced with https://github.com/libp2p/rust-libp2p/issues/4226. Let's not do it as part of this issue. In other words, let's always use a listener if available for now.
Let me know if you have any other questions @arsenron.
@mxinden Could you please help me with a question?
Could we make it as simple as adding a separate field to a Listener something like is_loopback bool. So when we add a new listener we check either ip is loopback and set it to true, or later if IfWatcher returns us a new loopback address, we also set it to true. Then in case a remote address is also loopback, we just iterate listeners to find a listener which supports loopback, if any. Or am I missing something?
Thank you @arsenron for the fix.
I will keep this issue open given that there is still the more sophisticated solution that go-libp2p follows.
How does go-libp2p solve this:
go-libp2p asks the OS for the correct interface given a destination address. It uses Google's routing package. See https://github.com/libp2p/go-libp2p/blob/260b9695cafdd8e35ec65b30ef153f0c15549c72/p2p/transport/quicreuse/reuse.go#L209 for concrete usage and implementation.
@arsenron let me know in case you want to contribute more to the project. The above more sophisticated solution might be an option. Or any of the other help-wanted issues.
https://github.com/libp2p/rust-libp2p/issues?q=is%3Aopen+is%3Aissue+label%3A%22help+wanted%22
hey guys, would you mind if I try to take a look at this one now?
'Cause PortUse was added after this was made.
Hi @momoshell are you still interested in this? If so feel free to take it
@jxs yes, and thanks, will give it my best spin.
I now realise that this issue is also affected by the port-reuse policies. It might be a good opportunity to bring some attention to this one PR:6187.
I believe it would be beneficial to address both of these in parallel.
@jxs #6224 is up, but we have those failing checks now! :(
I would also like to mention that I completely agree with @mxinden observation that the reading of OS routing tables is the proper way to go. Sadly, today net-route crate only supports Linux.
We could introduce the next phase as a separate PR, where we optimize for Linux machines, and we fallback to the PortReuse registry, until net-route adds support for all major OSs.
If you guys agree, I would gladly implement that optimization