Registration of multiple PinnedDeployments crashes the server
When running Restate in debug mode, we observed a crash when there were failovers and retries of invocations because the server was violating the invariant that there can at most be a single PinnedDeployment registered for a given invocation. There seems to be a case where this does not hold true. We need to understand what's the exact case provoking this situation in order to assess its severity and then fix it.
Until we create the next release, I will enable the debug assertions for release builds hoping that we find another instance of this problem.
The fundamental issue here is that the PinnedDEploument doesn't take part of the retry shield logic of the invoker, so it needs to be sent atomically together with the first entry. Yet this bug might be innocous unless the schema registry has to deployments with two different protocol versions, and we're in the distirbuted scenario
Aren't we only sending a EffectKind::PinnedDeployment message to the partition processor if there is a JournalEntry to be sent to the partition processor (https://github.com/restatedev/restate/blob/359bb0fa9b306cf65ea4be38a966245a28896c45/crates/invoker-impl/src/lib.rs#L757, https://github.com/restatedev/restate/blob/359bb0fa9b306cf65ea4be38a966245a28896c45/crates/invoker-impl/src/lib.rs#L711, https://github.com/restatedev/restate/blob/359bb0fa9b306cf65ea4be38a966245a28896c45/crates/invoker-impl/src/lib.rs#L808)? Since a JournalEntry takes part in the retry shield logic and we should maintain order of messages throughout sending it to the PP, writing it to Bifrost and then applying it, wouldn't this effectively mean that PinnedDeployment messages take part in the retry shield logic and are atomic with the first JournalEntry from the perspective of the invoker? The retry can only happen after the JournalEntry has been acked and this entails that the PinnedDeployment message has been applied (assuming order of messages).
Aren't we only sending a EffectKind::PinnedDeployment message to the partition processor if there is a JournalEntry to be sent to the partition processor
Yes but those are sent with two separate messages to the channel, so it's not atomic, because it can happen that some interleaving will close that channel, and only pinned deployment ends there and not the message afterward. Atomic I mean it should really be just a single bifrost message.
Why is it a problem if only the PinnedDeployment message gets applied? I assume it would be a problem if the invoker wouldn't see the effect of it before retrying but I haven't understood how this will exactly happen.
The one scenario I can think of where we might only write the PinnedDeployment message is if the leader loses leadership. In this case, however, when re-obtaining leadership the AnnounceLeadership message should make sure that a previous PinnedDeployment is applied before re-invoking the invocations, I believe.
What's the exact interleaving you have in mind which can be problematic?
I assume it would be a problem if the invoker wouldn't see the effect of it before retrying but I haven't understood how this will exactly happen.
Exactly this. pinned deployment is applied, but while this happens the invoker tries with a different deployment because it hasn't seen yet the update.
Exactly this. pinned deployment is applied, but while this happens the invoker tries with a different deployment because it hasn't seen yet the update.
How can this happen in your opinion (steps of the interleaving)? It's not clear to me how the invoker would retry w/o seeing the effect of the PinnedDeployment.
- pinned deployment sent to the channel, then sent to bifrost immediately
- now the second send from the invoker blocks because IDK channel full, task pre-empted or smth like that
- Leadership lost
- Second send doesn't go through, but nothing happens because result is ignored
- stuff happens, retry within the invoker happens before invoker reads Abort all from PP.
- pinned deployment sent to the channel, then sent to bifrost immediately
- now the second send from the invoker blocks because IDK channel full, task pre-empted or smth like that
- Leadership lost
- Second send doesn't go through, but nothing happens because result is ignored
- stuff happens, retry within the invoker happens before invoker reads Abort all from PP.
I don't think that a retry would happen if we miss writing something to the channel because it would have been registered in the JournalTracker before. In this case, the retry would be stuck forever because the PP would never send a storage ack back.
For posterity: The way we ran into this problem was by load-testing a 3 node Restate cluster (dev mode) using the mock-service endpoint. The cluster was in a bad shape which caused many leadership changes which caused retrying of invocations.