rust-libp2p `protocols/mdns` infinite loop?

Summary

It seems there is a busy loop, which never terminates when the process is being shutdown. I observe the problem in polkadot node.

Expected behaviour

The loop should end in case of erroneous conditions.

Actual behaviour

With RUST_LOG=trace, I see logs flooding the terminal. Parts of log are here. (However the problem is not strictly related to this PR, it can be reproduced by Ctrl-C during startup.) The single log iteration is as follows:

2023-01-25 12:51:28.215 TRACE tokio-runtime-worker netlink_proto::connection: polling Connection    
2023-01-25 12:51:28.215 DEBUG tokio-runtime-worker netlink_proto::connection: reading incoming messages    
2023-01-25 12:51:28.215 TRACE tokio-runtime-worker netlink_proto::connection: poll_read_messages called    
2023-01-25 12:51:28.215 TRACE tokio-runtime-worker netlink_proto::connection: polling socket    
2023-01-25 12:51:28.215 DEBUG tokio-runtime-worker netlink_proto::codecs: NetlinkCodec: decoding next message    
2023-01-25 12:51:28.215 TRACE tokio-runtime-worker netlink_proto::codecs: buffer is empty    
2023-01-25 12:51:28.215 TRACE tokio-runtime-worker netlink_sys::tokio: poll_recv_from called    
2023-01-25 12:51:28.215 ERROR tokio-runtime-worker netlink_proto::framed: failed to read from netlink socket: Custom { kind: Other, error: "A Tokio 1.x context was found, but it is being shutdown." }    
2023-01-25 12:51:28.215  WARN tokio-runtime-worker netlink_proto::connection: netlink socket stream shut down    
2023-01-25 12:51:28.215 DEBUG tokio-runtime-worker netlink_proto::connection: forwarding unsolicited messages to the connection handle    
2023-01-25 12:51:28.215 TRACE tokio-runtime-worker netlink_proto::connection: forward_unsolicited_messages called    
2023-01-25 12:51:28.215 TRACE tokio-runtime-worker netlink_proto::connection: forward_unsolicited_messages done    
2023-01-25 12:51:28.215 DEBUG tokio-runtime-worker netlink_proto::connection: forwaring responses to previous requests to the connection handle    
2023-01-25 12:51:28.215 TRACE tokio-runtime-worker netlink_proto::connection: forward_responses called    
2023-01-25 12:51:28.215 TRACE tokio-runtime-worker netlink_proto::connection: forward_responses done    
2023-01-25 12:51:28.215 DEBUG tokio-runtime-worker netlink_proto::connection: handling requests    
2023-01-25 12:51:28.215 TRACE tokio-runtime-worker netlink_proto::connection: poll_requests called    
2023-01-25 12:51:28.215 DEBUG tokio-runtime-worker netlink_proto::connection: sending messages    
2023-01-25 12:51:28.215 TRACE tokio-runtime-worker netlink_proto::connection: poll_send_messages called    
2023-01-25 12:51:28.215 TRACE tokio-runtime-worker netlink_proto::connection: poll_send_messages done    
2023-01-25 12:51:28.215 TRACE tokio-runtime-worker netlink_proto::connection: poll_flush called    
2023-01-25 12:51:28.215 TRACE tokio-runtime-worker netlink_proto::framed: flushing frame; length=24    
2023-01-25 12:51:28.215  WARN tokio-runtime-worker netlink_proto::connection: error flushing netlink socket: Custom { kind: Other, error: "A Tokio 1.x context was found, but it is being shutdown." }    
2023-01-25 12:51:28.215 TRACE tokio-runtime-worker netlink_proto::connection: done polling Connection    
2023-01-25 12:51:28.215 ERROR tokio-runtime-worker libp2p_mdns::behaviour: if watch returned an error: rtnetlink socket closed

I believe the problematic busy-loop is here. In case of Tokio being shutdown the rnetlink is providing stream of errors, and the loop never terminates.

Version

libp2p version (version number, commit, or branch): 0.50.0

Would you like to work on fixing this bug?

Maybe.

Jan 26 '23 13:01 michalkucharczyk

This sounds like we need to patch if-watch.

From an interface perspective, a stream ends when it returns None. It is not wrong per se to have a stream return Result::Err and continue emitting events later although abstractions like TryStream assume a stream is dead once it returned the first error. See https://docs.rs/futures/latest/futures/stream/trait.TryStreamExt.html#method.try_collect for example.

The implementation however never returns None, see https://github.com/mxinden/if-watch/blob/master/src/linux.rs for example. I think this is the bug, for some reason, we return the error here if the underlying connection is_ready but we don't have any messages.

@mxinden Do you know why that was implemented this way? It looks like there are a few subtle things going on there that aren't immediately clear.

Jan 26 '23 23:01 thomaseizinger

Thanks @michalkucharczyk for reporting.

@mxinden Do you know why that was implemented this way?

I don't unfortunately.

I am not sure what the best next step is.

In case all if-watch errors are fatal, I would suggest returning None after the first Error in if-watch and terminating in libp2p-mdns on the first if-watch error.
In case not all if-watch errors are fatal, we could at least wake our own waker and yield (Poll::Pending) in libp2p-mdns. Though intuitively, that sounds like a hack.

Feb 24 '23 08:02 mxinden

Could this be in any way related to an issue I'm seeing on Android where interfaces seem to be dropped/readded automatically on a loop? Basically a SwarmEvent::ExpiredListenAddr followed by a SwarmEvent::NewListenAddr for the same addresses, on a loop, every couple of seconds - but only for local addresses.

Mar 20 '23 20:03 cafedetal

Could this be in any way related to an issue I'm seeing on Android where interfaces seem to be dropped/readded automatically on a loop? Basically a SwarmEvent::ExpiredListenAddr followed by a SwarmEvent::NewListenAddr for the same addresses, on a loop, every couple of seconds - but only for local addresses.

Would you be up to patching if-watch the way @mxinden described above and deploy that to see whether it helps?

Mar 21 '23 14:03 thomaseizinger

I haven't done any deep dive into the if-watch code to understand the issue fully and patch easily but I can certainly try. A couple pointers would be appreciated, but not essential.

Mar 21 '23 15:03 cafedetal

I haven't done any deep dive into the if-watch code to understand the issue fully and patch easily but I can certainly try. A couple pointers would be appreciated, but not essential.

I think compiling for android will result in the fallback code being active. As you can see here, this Stream never ends:

https://github.com/mxinden/if-watch/blob/31c4780d803431914aab95ec9139aa8e5795f80c/src/fallback.rs#L73-L86

The only error can be emitted from here: https://github.com/mxinden/if-watch/blob/31c4780d803431914aab95ec9139aa8e5795f80c/src/fallback.rs#L51 which is defined in https://github.com/messense/if-addrs/blob/947c6342681b047b48b5f53eb75049881d2dfa20/src/lib.rs#L143-L212.

We'd have to audit this function to see whether the errors are permanent or not. If they are, we should be returning None from our Stream implementation after the first error.

Mar 21 '23 18:03 thomaseizinger

@thomaseizinger turns out the issue I'm seeing on android is unrelated (https://github.com/mxinden/if-watch/issues/30), but I would have taken a way longer time finding it if you hadn't pointed me in the right direction - many thanks for that, and sorry for inadvertedly hijacking this issue.

Mar 22 '23 14:03 cafedetal

@thomaseizinger turns out the issue I'm seeing on android is unrelated (mxinden/if-watch#30), but I would have taken a way longer time finding it if you hadn't pointed me in the right direction - many thanks for that, and sorry for inadvertedly hijacking this issue.

No worries at all, glad we could identify another bug! :)

Mar 22 '23 15:03 thomaseizinger

Is there any update on this?

May 22 '23 08:05 bkchr

I am not aware of any updates. I am fine with any of the two solutions suggested in https://github.com/libp2p/rust-libp2p/issues/3390#issuecomment-1443154305. @bkchr are you still seeing this in the wild?

May 22 '23 10:05 mxinden

Yes, very easy to reproduce. I needed to kill my process today twice with kill -9 (we wait 60 seconds for stuck futures, until we force shutdown the node).

In case all if-watch errors are fatal, I would suggest returning None after the first Error in if-watch and terminating in libp2p-mdns on the first if-watch error.

@michalkucharczyk maybe something you could do?

May 22 '23 10:05 bkchr

@michalkucharczyk maybe something you could do?

I'll put it on my todo-list.

May 22 '23 11:05 michalkucharczyk