zenoh
zenoh copied to clipboard
ZRuntime can hang due to blocking `flume::recv` and `sync::Mutex` locks inside `TransmissionPipelineProducer::push_network_message`
Describe the bug
The runtime can get blocked due to multiple workers waiting on a std::sync::Mutex while one of them does a blocking flume::recv. Eventually all workers are stalled on a mutex or a recv.
The trouble starts with a flume::recv in pipeline.rs that never returns for some reason and will block indefinitely. This is then made worse because there is a sync::Mutex that is locked in order to call StageIn::push_network_message. When other workers attempt to lock the same Mutex, they are left waiting and eventually hold up the entire tokio runtime.
To reproduce
Heavy load on a router with an S3-backed storage seems to show it quite quickly
System info
- Zenoh 0.11.0-rc.3
- Router running on an EC2 instance with S3 backend