ponyc icon indicating copy to clipboard operation
ponyc copied to clipboard

Example of deadlock with Pony runtime backpressure

Open slfritchie opened this issue 5 years ago • 1 comments

There's a bug in Wallaroo that I've known about for a while https://github.com/WallarooLabs/wallaroo/issues/3015 and have now figured out how & why it happened. After a chat with @SeanTAllen, we've agreed that this is a good demonstration of the current limits of Pony's runtime backpressure system.

Wallaroo's use of Pony runtime backpressure is limited. The use as described by bug 3015 is here and here. These are methods identical in purpose to a Pony TCPConnectionNotify's methods for managing a TCP connection.

The Wallaroo implementation of the TCP connection managing class, ConnectorSink in here, is not the same as Pony's standard library TCPConnection implementation. (Reasons for the difference aren't relevant enough to describe here.) One difference is the addition of a Timers actor at this location. This Timers actor is used to manage the delay periods after a ConnectorSink's TCP connection has been broken and needs to be re-established.

Here's a sketch of the events that lead to deadlock:

  1. A ConnectorSink actor initiates and successfully establishes a TCP connection to its remote peer.
  2. The same ConnectorSink's TCP connection is closed, e.g., the remote peer has crashed.
  3. In the ConnectorSinkNotify.throttled() method, we call Backpressure.apply(_auth) at here.
  4. We want to re-connect to the remote peer, so we create a Timer and give it to the Timers actor at here
  5. When step 4's timer fires, the Timers actor will execute the apply() function at here
    • Note that this function is executed by the Timers actor, not ConnectorSink!
    • Note also that _tcp_sink.reconnect() sends a message to ConnectorSink that will trigger a new TCP connection attempt.
    • Most importantly, note that ConnectorSink is under pressure!
      • This message send operation will mute the Timers actor.
      • But the muting doesn't happen until after the message send operation is done.
      • Now muted, the Timers actor will not be scheduled to run until it is unmuted.
  6. The reconnect() message arrives at ConnectorSink and triggers a new TCP connection attempt ... which fails, because the remote peer is still down.
  7. ConnectorSink creates a new timer and sends it to the Timers actor, just like step 4.

Now the deadlock starts to manifest itself.

  • The Timers actor is muted, so the 2nd timer request will sit in its mailbox indefinitely.
    • Reminder: this timer contains the code that sends a reconnect() message.
  • The ConnectorSink actor must receive a reconnect() message from the timer before it can re-connect to the remote peer and call ConnectorSinkNotify.unthrottle(), which is the only way to call Backpressure.release() to stop the runtime's backpressure and unmute the Timers actor.

Many other actors inside of Wallaroo send messages to ConnectorSink, which will mute them. A cascade of muted actors grows very quickly, as the backpressure system is designed to do. However, the Timers actor will never process the message that the ConnectorSink actor relies on to release backpressure.

Sean and I have come up with a couple of work-arounds in this particular case that do not apply to the general problem of avoiding this kind of deadlock:

  1. Sean suggested having the Timers actor put itself under pressure to avoid being muted.
    • For example, https://github.com/WallarooLabs/wallaroo/pull/3042/commits/db8f4b1c1b52be2f495cacea788b1a6f2fa45d77#diff-d4ed13b77de2c4736674cce3460a0cb0R1377-R1390
  2. My work-around is that we know that ConnectorSink's Timers actor only manages 0 or 1 timers at a time. If we instantiate a new Timers actor for each timer (instead of reusing a single Timers actor), then the current runtime implementation of allowing a message send operation before being muted ... is sufficient to avoid this deadlock case.

slfritchie avatar Nov 07 '19 23:11 slfritchie