zenoh ZRuntime can hang under load due to blocking `flume::send` in queryable callback

Describe the bug

Under heavy load if a queryable using the DefaultHandler gets too far behind, eventually the flume bounded queue will fill up and when it does the next sender.send(t) (from handlers.rs) will block and cause the runtime to get blocked. If more messages come in, they can simultaneously block all runtime threads until the program is simply hung up.

To reproduce

It should work with any queryable that can take some time to process but setting up a router with an S3-backed storage with 5000+ objects and then having it replicate to another router's storage is a good way to trigger this.

It's not deterministic but I was able to reproduce it quite regularly.

System info

S3-connected storage was on an EC2 instance running Ubuntu 22.04
Zenoh 0.11.0-rc.3

May 26 '24 16:05 chachi

We are going to investigate this. Did you experience the same behaviour in case of a subscriber in addition to a queryable?

May 28 '24 14:05 Mallets

Thanks! I didn’t test a subsription but looking at the root cause and the fix I applied to address it, I’d expect the same potential issue.

On Tue, May 28, 2024 at 8:52 AM Luca Cominardi @.***> wrote:

We are going to investigate this. Did you experience the same behaviour in case of a subscriber in addition to a queryable?

— Reply to this email directly, view it on GitHub https://github.com/eclipse-zenoh/zenoh/issues/1052#issuecomment-2135431187, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAE67XWP46OQMYAVR4LJBDZESKZNAVCNFSM6AAAAABIJ5VLU2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZVGQZTCMJYG4 . You are receiving this because you authored the thread.Message ID: @.***>

May 28 '24 15:05 chachi

Hello @chachi , we are trying to replicate it and we have managed to get something similar. However, we are not quite sure that is actually related to the flume::send in our experiments. Do you have any additional evidence/trace that point to that flume::send?

In addition to that, we have observed different behaviours in case of client and peer modes. Could you provide additional information on your configuration?

May 31 '24 14:05 Mallets

@Mallets My evidence was that our router would completely deadlock and a gdb thread apply all backtrace would show at least one thread blocked on flume::send with others blocking on a mutex that the send path was holding IIRC.

I found a few of these "deadlock"/runtime starvation items so you may have also found another one.

Jun 03 '24 15:06 chachi