zenoh icon indicating copy to clipboard operation
zenoh copied to clipboard

ZRuntime can hang due to blocking `flume::recv` and `sync::Mutex` locks inside `TransmissionPipelineProducer::push_network_message`

Open chachi opened this issue 1 year ago • 0 comments

Describe the bug

The runtime can get blocked due to multiple workers waiting on a std::sync::Mutex while one of them does a blocking flume::recv. Eventually all workers are stalled on a mutex or a recv.

The trouble starts with a flume::recv in pipeline.rs that never returns for some reason and will block indefinitely. This is then made worse because there is a sync::Mutex that is locked in order to call StageIn::push_network_message. When other workers attempt to lock the same Mutex, they are left waiting and eventually hold up the entire tokio runtime.

To reproduce

Heavy load on a router with an S3-backed storage seems to show it quite quickly

System info

  • Zenoh 0.11.0-rc.3
  • Router running on an EC2 instance with S3 backend

chachi avatar May 26 '24 16:05 chachi