Drop of `LinkUnicastTcp` can stall Runtime
Describe the bug
I observed the Acceptor runtime thread hanging on the drop of the LinkUnicastTcp then the drop of TcpStream inside of it.
With the linger parameter set to 10s by default and the impl Drop for LinkUnicastTcp commented out since the Tokio port, there's no async shutdown call anymore that allows other work to continue on while the drop happens.
To reproduce
Unsure exactly. My best guess is that if lots of clients are attempting to connect, perhaps some die before getting out of the Acceptor queue which causes the drop to happen within the Acceptor queue after a bit of data had been sent.
We have approximately 80 peers connecting to a single router.
System info
Router on AWS EC2 instance
Thanks for the report, we could have stumbled upon something similar during the investigation of https://github.com/eclipse-zenoh/zenoh/issues/1052 and https://github.com/eclipse-zenoh/zenoh/issues/1053. We'll keep investigating it.
@chachi could you share the topology and configuration of every node in your scenario? It would help the debugging. Thanks!
Sure, happy to. We have one router in the cloud with 80ish peers connecting to it over TCP (over a VPN) and each peer has one client connected over Unix sockets. There are two storages on each peer and the router.
On Wed, Jun 12, 2024 at 4:58 AM Luca Cominardi @.***> wrote:
@chachi https://github.com/chachi could you share the topology and configuration of every node in your scenario? It would help the debugging. Thanks!
— Reply to this email directly, view it on GitHub https://github.com/eclipse-zenoh/zenoh/issues/1101#issuecomment-2162478180, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAE67RVWVTOIJ2VFYKYZL3ZHAES3AVCNFSM6AAAAABJBLRS22VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRSGQ3TQMJYGA . You are receiving this because you were mentioned.Message ID: @.***>