rust-libp2p icon indicating copy to clipboard operation
rust-libp2p copied to clipboard

Address translation doesn't work correctly with non-reused ephemeral ports

Open dmitry-markin opened this issue 11 months ago • 7 comments

Summary

In libp2p-0.54.1 when handle_established_outbound_connection() of Identify protocol is called it seems port_use argument is not set to PortUse::New at least in some cases when ephemeral port was actually used.

This is what comment in ConnectedPoint::Dialer says:

/// The port use is implemented on a best-effort basis. It is not guaranteed
/// that [`PortUse::Reuse`] actually reused a port. A good example is the case
/// where there is no listener available to reuse a port from.

This leads to Identify mistakenly thinking that we reused the port, not taking this branch in emit_new_external_addr_candidate_event() and emitting such ephemeral address without translation in ToSwarm::NewExternalAddrCandidate(observed.clone()).

This issue was originally reported in https://github.com/paritytech/polkadot-sdk/issues/7207. It leads to ephemeral addresses being discovered as external in polkadot. For the context, polkadot does not use AutoNAT protocol and eagerly confirms all external address candidates provided by Identify behavior.

To obtain the logs below additional logging was added to Identify: https://github.com/dmitry-markin/rust-libp2p/commit/420567526b350033fcfb300ca61856cfb279a07a

Expected behavior

Only listen addresses reported by Identify for connections with actually reused port are emitted without address translation.

Actual behavior

Ephemeral addresses with not reused ports are emitted without address translation.

Relevant log output

>>> skipping translation for port-reuse address "/ip4/79.142.83.14/tcp/47936"
🔍 Discovered new external address for our node: /ip4/79.142.83.14/tcp/47936/p2p/12D3KooW9tGYAQmDAVUxsHMK9Xur3bkyNNiy7Mmv8uUh7UCzRFw2
>>> skipping translation for port-reuse address "/ip4/79.142.83.14/tcp/40506"
🔍 Discovered new external address for our node: /ip4/79.142.83.14/tcp/40506/p2p/12D3KooW9tGYAQmDAVUxsHMK9Xur3bkyNNiy7Mmv8uUh7UCzRFw2
>>> skipping translation for port-reuse address "/ip4/79.142.83.14/tcp/45810"
🔍 Discovered new external address for our node: /ip4/79.142.83.14/tcp/45810/p2p/12D3KooW9tGYAQmDAVUxsHMK9Xur3bkyNNiy7Mmv8uUh7UCzRFw2
>>> skipping translation for port-reuse address "/ip4/79.142.83.14/tcp/36244"
🔍 Discovered new external address for our node: /ip4/79.142.83.14/tcp/36244/p2p/12D3KooW9tGYAQmDAVUxsHMK9Xur3bkyNNiy7Mmv8uUh7UCzRFw2
...
(and so on)

Possible Solution

  1. Update Transport::Dial to yield both Transport::Output and PortUse.
  2. Update all transports' dial() to return the actual PortUse of the outbound socket created.
  3. Update pool::task::PendingConnectionEvent::ConnectionEstablished to include PortUse; update pool::PoolEvent to yield correct PortUse in ConnectedPoint.
  4. Update all affected types. This includes transport composition types like AndThenFuture and the such.

Version

0.54.1

Would you like to work on fixing this bug?

Maybe

dmitry-markin avatar Jan 17 '25 18:01 dmitry-markin

Additional context for why the port is not reused — this happens due to a secondary connection being established to the same host:port:

TRACE tokio-runtime-worker libp2p_tcp: Binding dial socket to listen socket address address=/ip4/64.62.224.10/tcp/30333/p2p/12D3KooWEjk6QXrZJ26fLpaajisJGHiz6WiQsR8k7mkM9GmWKnRZ
DEBUG tokio-runtime-worker libp2p_tcp: Failed to connect using existing socket because we already have a connection, re-dialing with new port connect_addr=64.62.224.10:30333 bind_addr=0.0.0.0:12345

This happens due to concurrent dial and some nodes providing multiple addresses (e.g. DNS + IP) that resolve to the same IP:port.

dmitry-markin avatar Jan 20 '25 12:01 dmitry-markin

This issue is also triggered without a secondary connection (concurrent dial disabled) to the same IP:port when listen address is not specified for some transport.

For example, if the node listens on a TCP address(e.g. /ip4/79.142.83.14/tcp/30333), but not on a WS address (e.g. /ip4/79.142.83.14/tcp/30334/ws), the TCP port reasonably won't be reused for WS remote addresses and ephemeral ports will be used instead.

dmitry-markin avatar Jan 21 '25 12:01 dmitry-markin

Thank you for the detailed bug report and for elaborating a possible solution! Sorry for the late reply.

Possible Solution

  1. Update Transport::Dial to yield both Transport::Output and PortUse.
  2. Update all transports' dial() to return the actual PortUse of the outbound socket created.
  3. Update pool::task::PendingConnectionEvent::ConnectionEstablished to include PortUse; update pool::PoolEvent to yield correct PortUse in ConnectedPoint.
  4. Update all affected types. This includes transport composition types like AndThenFuture and the such.

Yes that would solve it I think. However, it would require us to touch libp2p-core, all transport protocols, and the swarm. As you already noted in https://github.com/paritytech/polkadot-sdk/issues/7207#issuecomment-2604804377, it's quite a bit of effort.

I was wondering if we could instead solve the issue in the identify protocol. From a practical perspective, we need to know if a used port was ephemeral or re-used because we need to know if one of our listening ports was reused (which then suggest that address + port of that listener is externally reachable), right? We already track all current listening addresses in identify. Can't we just check in case of PortUse::Reuse that the port on that connection actually matches one of our current listening ports?

I am not super familiar with the port-reuse logic in different transports, so apologies if I am missing something very obvious.

elenaf9 avatar Feb 05 '25 13:02 elenaf9

@elenaf9 Hello, could you please take a look on my PR regarding this issue? https://github.com/libp2p/rust-libp2p/pull/6096

ErakhtinB avatar Jul 17 '25 05:07 ErakhtinB

@dmitry-markin @elenaf9 I've left a draft PR that tries to tackle this. So far it's working, but I would love to add more unit test to confirm this new flow.

@dmitry-markin thank you so much for your insights, this issue was a huge deal for quite some time now

momoshell avatar Oct 23 '25 19:10 momoshell

@dmitry-markin @elenaf9 I've left a draft PR that tries to tackle this. So far it's working, but I would love to add more unit test to confirm this new flow.

@dmitry-markin thank you so much for your insights, this issue was a huge deal for quite some time now

Did you have a look at #6096 if that would also fix your issue? Again, it would be great if we could void breaking the API of Transport if the issue can also be solved within identify.

elenaf9 avatar Oct 24 '25 06:10 elenaf9

Hey, yes, I saw it, and @ErakhtinB, thank you for this fix.

@elenaf9 While #6096 is an elegant, low-risk patch that resolves the immediate Identify symptoms—and could be merged quickly as a stopgap—it's essentially a band-aid on a systemic detection flaw. And, its scope is only limited to Identify; it doesn't address why other behaviors experience issues.

It's dealing with the symptom; in this case, it's the inappropriate address translation, and not the root cause.

The underlying issue is that the Identify protocol, or any other for that matter, receives misleading information from the transport layer. When an outbound connection is attempted with port reuse intent [PortUse::Reuse] but fails and falls back to an ephemeral port, the system can't correctly report what happened with the port. This causes Identify to incorrectly publish the ephemeral address as an external address.

Transport layer API gives inaccurate information about PortUse policy, because it can only reflect intent, not the outcome.

The #6187 wasn't intended to be merged ASAP; rather, I was hoping to work with you guys on this one, where the goal would be to achieve proper information on what happened with established connection endpoints.

The #6096 can't help with creating structured, composed, or rather new custom project-based NetworkBehaviour implementations. And certainly can't help in situations like dealing with protocol handler specific: HandlerEvent

momoshell avatar Oct 26 '25 10:10 momoshell