pingora icon indicating copy to clipboard operation
pingora copied to clipboard

Is the log level correct for ERROR 'failed to acquire reusable stream'

Open xushichangdesmond opened this issue 1 year ago • 4 comments
trafficstars

Hi

See the error 'failed to acquire reusable stream' in pingora-core/src/connectors/mod.rs

    pub async fn reused_stream<P: Peer + Send + Sync>(&self, peer: &P) -> Option<Stream> {
        match self.connection_pool.get(&peer.reuse_hash()) {
            Some(s) => {
                debug!("find reusable stream, trying to acquire it");
                {
                    let _ = s.lock().await;
                } // wait for the idle poll to release it
                match Arc::try_unwrap(s) {
                    Ok(l) => {
                        let mut stream = l.into_inner();
                        // test_reusable_stream: we assume server would never actively send data
                        // first on an idle stream.
                        if peer.matches_fd(stream.id()) && test_reusable_stream(&mut stream) {
                            Some(stream)
                        } else {
                            None
                        }
                    }
                    Err(_) => {
                        error!("failed to acquire reusable stream");
                        None
                    }
                }
            }
            None => {
                debug!("No reusable connection found for {peer}");
                None
            }
        }
    }

If I understand correctly, this should be maybe logged on DEBUG level more appropriately instead of error, since it isn't really an error if the other end drops the connection every now and then if theres no inflight requests.

xushichangdesmond avatar Apr 12 '24 10:04 xushichangdesmond

Makes sense, feel free to create a PR.

andrewhavck avatar Apr 12 '24 22:04 andrewhavck

Sure, please check https://github.com/cloudflare/pingora/pull/205

xushichangdesmond avatar Apr 13 '24 02:04 xushichangdesmond

Sorry for the confusion, I misread above and that log is actually due to the Arc::try_unwrap as there is more than one strong reference to the Arc which shouldn't happen. This is technically a bug and that's why we log an error, it's worth investigating further though.

andrewhavck avatar Apr 19 '24 21:04 andrewhavck

ConnectionPool::idle_poll(...) owns the lock via a OwnedMutexGuard, which holds the same arc reference of topic here(whose try_unwrap is failing) via OwnedMutexGuard::lock

after idle_poll exits, that guard is dropped and the lock released, but it is not clear to me that OwnedMutexLock::lock is released before or in atomic fashion with the release of that lock as in reused_stream(...)'s let _ = s.lock().await;

If this sounds possible, maybe we should try to reproduce this in a small test case just using the Arc<Mutex<()>> pattern together with an async use of the mutex's try_lock_owned() or otherwise

xushichangdesmond avatar Apr 22 '24 02:04 xushichangdesmond

Have you been able to reproduce this issue?

andrewhavck avatar May 03 '24 16:05 andrewhavck

Closing this issue as the log level seems appropriate and we're only seeing this log on machines that are in a bad state. If you're able to consistently repro please open a new issue, thanks.

andrewhavck avatar May 10 '24 23:05 andrewhavck

What are possible cause of this? I'm getting this consistently with these peer options

let mut result = PeerOptions::new();
result.tcp_keepalive = Some(TcpKeepalive {
    count: 3,
    idle: Duration::from_secs(60),
    interval: Duration::from_secs(20),
});
result.total_connection_timeout = Some(Duration::from_secs(10));
result.idle_timeout = Some(Duration::from_secs(60));
result.read_timeout = Some(Duration::from_secs(15));
result.write_timeout = Some(Duration::from_secs(15));
result.alpn = ALPN::H2H1;
result.max_h2_streams = 1;
result.h2_ping_interval = None;
result

Object905 avatar Jun 07 '24 12:06 Object905