rust-libp2p Panic in `libp2p-websocket` at 'SinkImpl::poll

CC #2598 CC https://github.com/nimiq/core-rs-albatross/issues/732

2024-06-10T15:15:27.806153220Z ERROR panic                | thread 'tokio-runtime-worker' panicked at 'SinkImpl::poll_ready called after error.': /home/nimiq/.cargo/registry/src/index.crates.io-6f17d22bba15001f/libp2p-websocket-0.43.0/src/quicksink.rs:159
    0: <backtrace::capture::Backtrace as core::default::Default>::default
    1: log_panics::Config::install_panic_hook::{{closure}}
    2: <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call
              at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/alloc/src/boxed.rs:2021:9
       std::panicking::rust_panic_with_hook
              at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/panicking.rs:735:13
    3: std::panicking::begin_panic_handler::{{closure}}
              at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/panicking.rs:601:13
    4: std::sys_common::backtrace::__rust_end_short_backtrace
              at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/sys_common/backtrace.rs:170:18
    5: rust_begin_unwind
              at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/panicking.rs:597:5
    6: core::panicking::panic_fmt
              at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/core/src/panicking.rs:72:14
    7: <libp2p_websocket::quicksink::SinkImpl<S,F,T,A,E> as futures_sink::Sink<A>>::poll_ready
    8: <rw_stream_sink::RwStreamSink<S> as futures_io::if_std::AsyncWrite>::poll_write
    9: <asynchronous_codec::framed::Framed<T,U> as futures_sink::Sink<<U as asynchronous_codec::encoder::Encoder>::Item>>::poll_ready
   10: <libp2p_noise::io::Output<T> as futures_io::if_std::AsyncWrite>::poll_write
   11: <yamux::frame::io::Io<T> as futures_sink::Sink<yamux::frame::Frame<()>>>::poll_ready
   12: yamux::connection::Connection<T>::poll_next_inbound
   13: libp2p_yamux::Muxer<C>::poll_inner
   14: <libp2p_yamux::Muxer<C> as libp2p_core::muxing::StreamMuxer>::poll
   15: <libp2p_core::muxing::boxed::Wrap<T> as libp2p_core::muxing::StreamMuxer>::poll
   16: <libp2p_core::muxing::boxed::Wrap<T> as libp2p_core::muxing::StreamMuxer>::poll
   17: libp2p_swarm::connection::Connection<THandler>::poll
   18: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
   19: tokio::runtime::task::core::Core<T,S>::poll
   20: tokio::runtime::task::raw::poll
   21: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
   22: tokio::runtime::scheduler::multi_thread::worker::Context::run
   23: tokio::runtime::context::scoped::Scoped<T>::set
   24: tokio::runtime::context::runtime::enter_runtime
   25: tokio::runtime::scheduler::multi_thread::worker::run
   26: tokio::runtime::task::core::Core<T,S>::poll
   27: tokio::runtime::task::harness::Harness<T,S>::poll
   28: tokio::runtime::blocking::pool::Inner::run
   29: std::sys_common::backtrace::__rust_begin_short_backtrace
   30: core::ops::function::FnOnce::call_once{{vtable.shim}}
   31: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
              at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/alloc/src/boxed.rs:2007:9
       <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
              at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/alloc/src/boxed.rs:2007:9
       std::sys::unix::thread::Thread::new::thread_start
              at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/sys/unix/thread.rs:108:17
   32: <unknown>
   33: <unknown>

I again don't have a reproducer, but it seems likely to be fixable without reproduction.

libp2p 0.53.2
libp2p-websocket 0.43.0
libp2p-yamux 0.45.1
yamux 0.12.1
yamux 0.13.2

Jun 17 '24 09:06 hrxi

could you provide your configuration on how youre using yamux or are you using the default configuration? Any logs you could provide up to the panic?

Jun 17 '24 12:06 dariusc93

@dariusc93 For Yamux, we use the default configuration (yamux::Config::default()) And the transport for the swarm is constructed as following:

transport
    .upgrade(core::upgrade::Version::V1)
    .authenticate(noise::Config::new(keypair).unwrap())
    .multiplex(yamux)
    .timeout(std::time::Duration::from_secs(20))
    .boxed()

Unfortunately we don't have more relevant logs at the moment.

Jun 17 '24 13:06 Eligioo

@dariusc93 I checked the logs of our nodes around the multiple times we noticed the panics and checked what our other peers, that didn't panic, were doing around those times. What I noticed is that the once that didn't panic, closed the connection due to an error (see below).

The direction I'm currently looking at is the moment is that the publish fails, the Connection<THandler> gets polled again and starts processing the muxing.poll_unpin for a ConnectionEvent::AddressChange, which shouldn't happen because the connection already errored. But this isn't backed by anything because I'm not seeing failed publishes anywhere.

Non-panicked peer:

Jun 10 15:15:26: 2024-06-10T15:15:26.676079329Z DEBUG block_queue          | Broadcasting on BlockHeaderTopic block="#6668451:MI:e88944e85d"
Jun 10 15:15:27: 2024-06-10T15:15:27.243774270Z INFO  network              | Connection closed with peer connection_id=2165980 peer_id=12D3KooWHEyf9LRTEwDXkXWyZGz9T1pv9ciHC1V2wXiGbjNbjhJe endpoint=Listener { local_addr: "/ip4/46.101.171.193/tcp/8443/wss", send_back_addr: "/ip4/168.119.103.198/tcp/33580/wss" } connections=0
Jun 10 15:15:27: 2024-06-10T15:15:27.243845529Z INFO  network              | Connection closed because cause=Connection error: I/O error: connection is closed

A panicked peers were also broadcasting to the BlockHeaderTopic gossipsub topic just before.

Jun 18 '24 09:06 Eligioo

@jxs can you help here? We are experiencing the same: https://github.com/paritytech/polkadot-sdk/issues/4934

Jul 03 '24 08:07 bkchr

Panic in `libp2p-websocket` at 'SinkImpl::poll_ready called after error.'