ponyc
ponyc copied to clipboard
Example of deadlock with Pony runtime backpressure
There's a bug in Wallaroo that I've known about for a while https://github.com/WallarooLabs/wallaroo/issues/3015 and have now figured out how & why it happened. After a chat with @SeanTAllen, we've agreed that this is a good demonstration of the current limits of Pony's runtime backpressure system.
Wallaroo's use of Pony runtime backpressure is limited. The use as described by bug 3015 is here and here. These are methods identical in purpose to a Pony TCPConnectionNotify's methods for managing a TCP connection.
The Wallaroo implementation of the TCP connection managing class, ConnectorSink in here, is not the same as Pony's standard library TCPConnection implementation. (Reasons for the difference aren't relevant enough to describe here.) One difference is the addition of a Timers actor at this location. This Timers actor is used to manage the delay periods after a ConnectorSink's TCP connection has been broken and needs to be re-established.
Here's a sketch of the events that lead to deadlock:
- A
ConnectorSinkactor initiates and successfully establishes a TCP connection to its remote peer. - The same
ConnectorSink's TCP connection is closed, e.g., the remote peer has crashed. - In the
ConnectorSinkNotify.throttled()method, we callBackpressure.apply(_auth)at here. - We want to re-connect to the remote peer, so we create a
Timerand give it to theTimersactor at here - When step 4's timer fires, the
Timersactor will execute theapply()function at here- Note that this function is executed by the
Timersactor, notConnectorSink! - Note also that
_tcp_sink.reconnect()sends a message toConnectorSinkthat will trigger a new TCP connection attempt. - Most importantly, note that
ConnectorSinkis under pressure!- This message send operation will mute the
Timersactor. - But the muting doesn't happen until after the message send operation is done.
- Now muted, the
Timersactor will not be scheduled to run until it is unmuted.
- This message send operation will mute the
- Note that this function is executed by the
- The
reconnect()message arrives atConnectorSinkand triggers a new TCP connection attempt ... which fails, because the remote peer is still down. ConnectorSinkcreates a new timer and sends it to theTimersactor, just like step 4.
Now the deadlock starts to manifest itself.
- The
Timersactor is muted, so the 2nd timer request will sit in its mailbox indefinitely.- Reminder: this timer contains the code that sends a
reconnect()message.
- Reminder: this timer contains the code that sends a
- The
ConnectorSinkactor must receive areconnect()message from the timer before it can re-connect to the remote peer and callConnectorSinkNotify.unthrottle(), which is the only way to callBackpressure.release()to stop the runtime's backpressure and unmute theTimersactor.
Many other actors inside of Wallaroo send messages to ConnectorSink, which will mute them. A cascade of muted actors grows very quickly, as the backpressure system is designed to do. However, the Timers actor will never process the message that the ConnectorSink actor relies on to release backpressure.
Sean and I have come up with a couple of work-arounds in this particular case that do not apply to the general problem of avoiding this kind of deadlock:
- Sean suggested having the
Timersactor put itself under pressure to avoid being muted.- For example, https://github.com/WallarooLabs/wallaroo/pull/3042/commits/db8f4b1c1b52be2f495cacea788b1a6f2fa45d77#diff-d4ed13b77de2c4736674cce3460a0cb0R1377-R1390
- My work-around is that we know that
ConnectorSink'sTimersactor only manages 0 or 1 timers at a time. If we instantiate a newTimersactor for each timer (instead of reusing a singleTimersactor), then the current runtime implementation of allowing a message send operation before being muted ... is sufficient to avoid this deadlock case.