`protocols/mdns` infinite loop?
Summary
It seems there is a busy loop, which never terminates when the process is being shutdown. I observe the problem in polkadot node.
Expected behaviour
The loop should end in case of erroneous conditions.
Actual behaviour
With RUST_LOG=trace, I see logs flooding the terminal. Parts of log are here. (However the problem is not strictly related to this PR, it can be reproduced by Ctrl-C during startup.)
The single log iteration is as follows:
2023-01-25 12:51:28.215 TRACE tokio-runtime-worker netlink_proto::connection: polling Connection
2023-01-25 12:51:28.215 DEBUG tokio-runtime-worker netlink_proto::connection: reading incoming messages
2023-01-25 12:51:28.215 TRACE tokio-runtime-worker netlink_proto::connection: poll_read_messages called
2023-01-25 12:51:28.215 TRACE tokio-runtime-worker netlink_proto::connection: polling socket
2023-01-25 12:51:28.215 DEBUG tokio-runtime-worker netlink_proto::codecs: NetlinkCodec: decoding next message
2023-01-25 12:51:28.215 TRACE tokio-runtime-worker netlink_proto::codecs: buffer is empty
2023-01-25 12:51:28.215 TRACE tokio-runtime-worker netlink_sys::tokio: poll_recv_from called
2023-01-25 12:51:28.215 ERROR tokio-runtime-worker netlink_proto::framed: failed to read from netlink socket: Custom { kind: Other, error: "A Tokio 1.x context was found, but it is being shutdown." }
2023-01-25 12:51:28.215 WARN tokio-runtime-worker netlink_proto::connection: netlink socket stream shut down
2023-01-25 12:51:28.215 DEBUG tokio-runtime-worker netlink_proto::connection: forwarding unsolicited messages to the connection handle
2023-01-25 12:51:28.215 TRACE tokio-runtime-worker netlink_proto::connection: forward_unsolicited_messages called
2023-01-25 12:51:28.215 TRACE tokio-runtime-worker netlink_proto::connection: forward_unsolicited_messages done
2023-01-25 12:51:28.215 DEBUG tokio-runtime-worker netlink_proto::connection: forwaring responses to previous requests to the connection handle
2023-01-25 12:51:28.215 TRACE tokio-runtime-worker netlink_proto::connection: forward_responses called
2023-01-25 12:51:28.215 TRACE tokio-runtime-worker netlink_proto::connection: forward_responses done
2023-01-25 12:51:28.215 DEBUG tokio-runtime-worker netlink_proto::connection: handling requests
2023-01-25 12:51:28.215 TRACE tokio-runtime-worker netlink_proto::connection: poll_requests called
2023-01-25 12:51:28.215 DEBUG tokio-runtime-worker netlink_proto::connection: sending messages
2023-01-25 12:51:28.215 TRACE tokio-runtime-worker netlink_proto::connection: poll_send_messages called
2023-01-25 12:51:28.215 TRACE tokio-runtime-worker netlink_proto::connection: poll_send_messages done
2023-01-25 12:51:28.215 TRACE tokio-runtime-worker netlink_proto::connection: poll_flush called
2023-01-25 12:51:28.215 TRACE tokio-runtime-worker netlink_proto::framed: flushing frame; length=24
2023-01-25 12:51:28.215 WARN tokio-runtime-worker netlink_proto::connection: error flushing netlink socket: Custom { kind: Other, error: "A Tokio 1.x context was found, but it is being shutdown." }
2023-01-25 12:51:28.215 TRACE tokio-runtime-worker netlink_proto::connection: done polling Connection
2023-01-25 12:51:28.215 ERROR tokio-runtime-worker libp2p_mdns::behaviour: if watch returned an error: rtnetlink socket closed
I believe the problematic busy-loop is here.
In case of Tokio being shutdown the rnetlink is providing stream of errors, and the loop never terminates.
Version
- libp2p version (version number, commit, or branch): 0.50.0
Would you like to work on fixing this bug?
Maybe.
This sounds like we need to patch if-watch.
From an interface perspective, a stream ends when it returns None. It is not wrong per se to have a stream return Result::Err and continue emitting events later although abstractions like TryStream assume a stream is dead once it returned the first error. See https://docs.rs/futures/latest/futures/stream/trait.TryStreamExt.html#method.try_collect for example.
The implementation however never returns None, see https://github.com/mxinden/if-watch/blob/master/src/linux.rs for example. I think this is the bug, for some reason, we return the error here if the underlying connection is_ready but we don't have any messages.
@mxinden Do you know why that was implemented this way? It looks like there are a few subtle things going on there that aren't immediately clear.
Thanks @michalkucharczyk for reporting.
@mxinden Do you know why that was implemented this way?
I don't unfortunately.
I am not sure what the best next step is.
- In case all
if-watcherrors are fatal, I would suggest returningNoneafter the firstErrorinif-watchand terminating inlibp2p-mdnson the firstif-watcherror. - In case not all
if-watcherrors are fatal, we could at least wake our own waker and yield (Poll::Pending) inlibp2p-mdns. Though intuitively, that sounds like a hack.
Could this be in any way related to an issue I'm seeing on Android where interfaces seem to be dropped/readded automatically on a loop? Basically a SwarmEvent::ExpiredListenAddr followed by a SwarmEvent::NewListenAddr for the same addresses, on a loop, every couple of seconds - but only for local addresses.
Could this be in any way related to an issue I'm seeing on Android where interfaces seem to be dropped/readded automatically on a loop? Basically a SwarmEvent::ExpiredListenAddr followed by a SwarmEvent::NewListenAddr for the same addresses, on a loop, every couple of seconds - but only for local addresses.
Would you be up to patching if-watch the way @mxinden described above and deploy that to see whether it helps?
I haven't done any deep dive into the if-watch code to understand the issue fully and patch easily but I can certainly try. A couple pointers would be appreciated, but not essential.
I haven't done any deep dive into the if-watch code to understand the issue fully and patch easily but I can certainly try. A couple pointers would be appreciated, but not essential.
I think compiling for android will result in the fallback code being active. As you can see here, this Stream never ends:
https://github.com/mxinden/if-watch/blob/31c4780d803431914aab95ec9139aa8e5795f80c/src/fallback.rs#L73-L86
The only error can be emitted from here: https://github.com/mxinden/if-watch/blob/31c4780d803431914aab95ec9139aa8e5795f80c/src/fallback.rs#L51 which is defined in https://github.com/messense/if-addrs/blob/947c6342681b047b48b5f53eb75049881d2dfa20/src/lib.rs#L143-L212.
We'd have to audit this function to see whether the errors are permanent or not. If they are, we should be returning None from our Stream implementation after the first error.
@thomaseizinger turns out the issue I'm seeing on android is unrelated (https://github.com/mxinden/if-watch/issues/30), but I would have taken a way longer time finding it if you hadn't pointed me in the right direction - many thanks for that, and sorry for inadvertedly hijacking this issue.
@thomaseizinger turns out the issue I'm seeing on android is unrelated (mxinden/if-watch#30), but I would have taken a way longer time finding it if you hadn't pointed me in the right direction - many thanks for that, and sorry for inadvertedly hijacking this issue.
No worries at all, glad we could identify another bug! :)
Is there any update on this?
I am not aware of any updates. I am fine with any of the two solutions suggested in https://github.com/libp2p/rust-libp2p/issues/3390#issuecomment-1443154305. @bkchr are you still seeing this in the wild?
Yes, very easy to reproduce. I needed to kill my process today twice with kill -9 (we wait 60 seconds for stuck futures, until we force shutdown the node).
- In case all
if-watcherrors are fatal, I would suggest returningNoneafter the firstErrorinif-watchand terminating inlibp2p-mdnson the firstif-watcherror.
@michalkucharczyk maybe something you could do?
@michalkucharczyk maybe something you could do?
I'll put it on my todo-list.