restate icon indicating copy to clipboard operation
restate copied to clipboard

Support enabling PP state machine features in a multi-node setup

Open tillrohrmann opened this issue 10 months ago • 6 comments

With the introduction of distributed Restate, we now have the situation that enabling features in the PP state machine that change its behavior (e.g. writing different messages, following different code paths, etc.) can lead to a divergence of state if not all state machines start changing their behavior at the same log message. An example of a new feature that would be problematic to activate without a coordinated approach is to skip the outbox and shuffle for outgoing messages that are targeted to oneself.

To avoid that state machines go out of sync, I propose to have a new log command StateMachineVersion that tells partition processors which version of the state machine to run. The state machine version would encode which features to use and it would be persisted within the partition store so that one knows about it when restarting. If a partition processor encounters a state machine version that it does not support, then it would fail the partition processor (to make the PP run, one would have to update the node's Restate version).

With such a mechanism, an administrator could instruct nodes to run a certain version (e.g. after it has upgraded all nodes and can guarantee that any newly started nodes will support at least version x). The instruction would be received by the current leader of a partition and would result into writing a new StateMachineVersion to the log.

Ideally, every log would contain a StateMachineVersion and if not, then a PP would write it's currently supported version. However when resuming from older Restate data, these commands weren't written by Restate. Therefore, we either find another way to reliably say that a log is freshly provisioned or require an explicit operator command to use the current state machine version.

tillrohrmann avatar Feb 21 '25 10:02 tillrohrmann

I like it! To my mind, this is a simple version of distributed write schema migrations. We need to make the version switches transactionally, at a given LSN, and we need a sufficient majority of PPs to agree that it's safe to do this.

However when resuming from older Restate data, these commands weren't written by Restate. Therefore, we either find another way to reliably say that a log is freshly provisioned or require an explicit operator command to use the current state machine version.

One possibility to automate is to broadcast a version bump proposal to the cluster, and require that a certain number of followers respond. This could be configurable or derived from the replication property - if we could get an f-majority of the known partition replicas to agree on the version bump, we can be sure that even if some are offline at the time, we can increment the version reasonably safely.

If we have such a mechanism, we can assume that no-stored-version implies some default baseline V0, and attempt to upgrade automatically on startup.

pcholakov avatar Feb 25 '25 14:02 pcholakov

We need to make the version switches transactionally, at a given LSN, and we need a sufficient majority of PPs to agree that it's safe to do this.

Not sure this is even needed, perhaps it can be as simple as adding to the AnnounceLeader message the StateMachine version, and if me as leader/follower I can't support this version, I crash.

slinkydeveloper avatar Feb 28 '25 13:02 slinkydeveloper

That would be "correct" but is it nice behavior? Consider the operator experience if you're doing a rolling software upgrade in a large cluster - you deploy a new version to the first node, it takes leadership for some partitions, and followers still on the older version crash?

We probably want some level of safety there, even if we still recommend a two-phased roll out software version / enable new feature requiring latest state machine version as separate steps. I think this is a little bit against the grain of the "batteries included, it does the sane thing" philosophy generally.

I think that tracking some notion of FSM / schema version is a requisite first step regardless of the upgrade process. I'm pretty keen on the idea of writing a seed message to the partition log as the very first transaction which sets the stage - initial StateMachineVersion, the unique cluster id (cf #2783) that this partition belongs to.

pcholakov avatar Mar 01 '25 10:03 pcholakov

The problem of crashing nodes during a rolling upgrade could be solved by only activating the new state machine version in the next version (similar to how we do it for storage format changes). Additionally, we could have an explicit message that an operator can instruct the nodes to write to the log if one wants to activate the new state machine version in the current version (e.g. after the rolling upgrade has completed).

What's not fully clear to me from the conversation is how a possible seed message will be written (when, who). We also need to make it work with an existing log where this message does not exist (when upgrading from an older version).

tillrohrmann avatar Mar 03 '25 10:03 tillrohrmann

To sum up the current discussion: Right now, we don't have a mechanism to change the behavior of the business logic (PP state machine) which is not based on input data. For example, if we wanted to change how self messages are shuffled (e.g. by directly putting them in the corresponding inbox instead of the outbox), we either need to mark the incoming invocations or have a mechanism where the PP state machine replicas can activate features in a coordinated fashion. Otherwise we risk that different state machine versions take different decisions and end up in a diverged state.

For the latter approach, the following idea might work: Every state machine supports a range of business logic versions. There is a safe version which is guaranteed to be understood by the previous Restate version and a max version. Whenever a PP becomes leader, it will set the max of it's safe version and the maximum seen state machine version so far in the AnnounceLeadership message. Once a PP reads a new AnnounceLeadership message, they know which state machine version to use. The current state machine version will be written to the PartitionStore so that it will be included in snapshots. Operators have also the opportunity to manually bump the state machine version by writing an explicit command to the log (at the cost of sacrificing rolling back if the chosen version is not supported by the previous Restate version).

In order to evolve the state machine logic, one would do the usual dance of backwards compatibility: First introducing a new state machine version V + 1 in Restate 1.x while keeping V as the safe version. In Restate 1.x+1, it's guaranteed that all Restate servers understand V + 1, so that once can set it as the safe version.

tillrohrmann avatar Mar 13 '25 09:03 tillrohrmann

fyi @AhmedSoliman

tillrohrmann avatar Mar 13 '25 09:03 tillrohrmann