Panic in `libp2p-websocket` at 'SinkImpl::poll_ready called after error.'
CC #2598 CC https://github.com/nimiq/core-rs-albatross/issues/732
2024-06-10T15:15:27.806153220Z ERROR panic | thread 'tokio-runtime-worker' panicked at 'SinkImpl::poll_ready called after error.': /home/nimiq/.cargo/registry/src/index.crates.io-6f17d22bba15001f/libp2p-websocket-0.43.0/src/quicksink.rs:159
0: <backtrace::capture::Backtrace as core::default::Default>::default
1: log_panics::Config::install_panic_hook::{{closure}}
2: <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call
at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/alloc/src/boxed.rs:2021:9
std::panicking::rust_panic_with_hook
at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/panicking.rs:735:13
3: std::panicking::begin_panic_handler::{{closure}}
at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/panicking.rs:601:13
4: std::sys_common::backtrace::__rust_end_short_backtrace
at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/sys_common/backtrace.rs:170:18
5: rust_begin_unwind
at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/panicking.rs:597:5
6: core::panicking::panic_fmt
at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/core/src/panicking.rs:72:14
7: <libp2p_websocket::quicksink::SinkImpl<S,F,T,A,E> as futures_sink::Sink<A>>::poll_ready
8: <rw_stream_sink::RwStreamSink<S> as futures_io::if_std::AsyncWrite>::poll_write
9: <asynchronous_codec::framed::Framed<T,U> as futures_sink::Sink<<U as asynchronous_codec::encoder::Encoder>::Item>>::poll_ready
10: <libp2p_noise::io::Output<T> as futures_io::if_std::AsyncWrite>::poll_write
11: <yamux::frame::io::Io<T> as futures_sink::Sink<yamux::frame::Frame<()>>>::poll_ready
12: yamux::connection::Connection<T>::poll_next_inbound
13: libp2p_yamux::Muxer<C>::poll_inner
14: <libp2p_yamux::Muxer<C> as libp2p_core::muxing::StreamMuxer>::poll
15: <libp2p_core::muxing::boxed::Wrap<T> as libp2p_core::muxing::StreamMuxer>::poll
16: <libp2p_core::muxing::boxed::Wrap<T> as libp2p_core::muxing::StreamMuxer>::poll
17: libp2p_swarm::connection::Connection<THandler>::poll
18: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
19: tokio::runtime::task::core::Core<T,S>::poll
20: tokio::runtime::task::raw::poll
21: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
22: tokio::runtime::scheduler::multi_thread::worker::Context::run
23: tokio::runtime::context::scoped::Scoped<T>::set
24: tokio::runtime::context::runtime::enter_runtime
25: tokio::runtime::scheduler::multi_thread::worker::run
26: tokio::runtime::task::core::Core<T,S>::poll
27: tokio::runtime::task::harness::Harness<T,S>::poll
28: tokio::runtime::blocking::pool::Inner::run
29: std::sys_common::backtrace::__rust_begin_short_backtrace
30: core::ops::function::FnOnce::call_once{{vtable.shim}}
31: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/alloc/src/boxed.rs:2007:9
<alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/alloc/src/boxed.rs:2007:9
std::sys::unix::thread::Thread::new::thread_start
at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/sys/unix/thread.rs:108:17
32: <unknown>
33: <unknown>
I again don't have a reproducer, but it seems likely to be fixable without reproduction.
libp2p 0.53.2
libp2p-websocket 0.43.0
libp2p-yamux 0.45.1
yamux 0.12.1
yamux 0.13.2
could you provide your configuration on how youre using yamux or are you using the default configuration? Any logs you could provide up to the panic?
@dariusc93
For Yamux, we use the default configuration (yamux::Config::default())
And the transport for the swarm is constructed as following:
transport
.upgrade(core::upgrade::Version::V1)
.authenticate(noise::Config::new(keypair).unwrap())
.multiplex(yamux)
.timeout(std::time::Duration::from_secs(20))
.boxed()
Unfortunately we don't have more relevant logs at the moment.
@dariusc93 I checked the logs of our nodes around the multiple times we noticed the panics and checked what our other peers, that didn't panic, were doing around those times. What I noticed is that the once that didn't panic, closed the connection due to an error (see below).
The direction I'm currently looking at is the moment is that the publish fails, the Connection<THandler> gets polled again and starts processing the muxing.poll_unpin for a ConnectionEvent::AddressChange, which shouldn't happen because the connection already errored. But this isn't backed by anything because I'm not seeing failed publishes anywhere.
Non-panicked peer:
Jun 10 15:15:26: 2024-06-10T15:15:26.676079329Z DEBUG block_queue | Broadcasting on BlockHeaderTopic block="#6668451:MI:e88944e85d"
Jun 10 15:15:27: 2024-06-10T15:15:27.243774270Z INFO network | Connection closed with peer connection_id=2165980 peer_id=12D3KooWHEyf9LRTEwDXkXWyZGz9T1pv9ciHC1V2wXiGbjNbjhJe endpoint=Listener { local_addr: "/ip4/46.101.171.193/tcp/8443/wss", send_back_addr: "/ip4/168.119.103.198/tcp/33580/wss" } connections=0
Jun 10 15:15:27: 2024-06-10T15:15:27.243845529Z INFO network | Connection closed because cause=Connection error: I/O error: connection is closed
A panicked peers were also broadcasting to the BlockHeaderTopic gossipsub topic just before.
@jxs can you help here? We are experiencing the same: https://github.com/paritytech/polkadot-sdk/issues/4934