zenoh icon indicating copy to clipboard operation
zenoh copied to clipboard

ZRuntime can hang under load due to blocking `flume::send` in queryable callback

Open chachi opened this issue 1 year ago • 4 comments

Describe the bug

Under heavy load if a queryable using the DefaultHandler gets too far behind, eventually the flume bounded queue will fill up and when it does the next sender.send(t) (from handlers.rs) will block and cause the runtime to get blocked. If more messages come in, they can simultaneously block all runtime threads until the program is simply hung up.

To reproduce

It should work with any queryable that can take some time to process but setting up a router with an S3-backed storage with 5000+ objects and then having it replicate to another router's storage is a good way to trigger this.

It's not deterministic but I was able to reproduce it quite regularly.

System info

  • S3-connected storage was on an EC2 instance running Ubuntu 22.04
  • Zenoh 0.11.0-rc.3

chachi avatar May 26 '24 16:05 chachi

We are going to investigate this. Did you experience the same behaviour in case of a subscriber in addition to a queryable?

Mallets avatar May 28 '24 14:05 Mallets

Thanks! I didn’t test a subsription but looking at the root cause and the fix I applied to address it, I’d expect the same potential issue.

On Tue, May 28, 2024 at 8:52 AM Luca Cominardi @.***> wrote:

We are going to investigate this. Did you experience the same behaviour in case of a subscriber in addition to a queryable?

— Reply to this email directly, view it on GitHub https://github.com/eclipse-zenoh/zenoh/issues/1052#issuecomment-2135431187, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAE67XWP46OQMYAVR4LJBDZESKZNAVCNFSM6AAAAABIJ5VLU2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZVGQZTCMJYG4 . You are receiving this because you authored the thread.Message ID: @.***>

chachi avatar May 28 '24 15:05 chachi

Hello @chachi , we are trying to replicate it and we have managed to get something similar. However, we are not quite sure that is actually related to the flume::send in our experiments. Do you have any additional evidence/trace that point to that flume::send?

In addition to that, we have observed different behaviours in case of client and peer modes. Could you provide additional information on your configuration?

Mallets avatar May 31 '24 14:05 Mallets

@Mallets My evidence was that our router would completely deadlock and a gdb thread apply all backtrace would show at least one thread blocked on flume::send with others blocking on a mutex that the send path was holding IIRC.

I found a few of these "deadlock"/runtime starvation items so you may have also found another one.

chachi avatar Jun 03 '24 15:06 chachi