restate Action::ScheduleInvocationStatusCleanup can get lost on recoveries

I think that the system is susceptible to losing Action::ScheduleInvocationStatusCleanup messages that are used to communicate to replicas that a completed invocation result should be cleaned up at some point in the future. We do generate the Action::ScheduleInvocationStatusCleanup when we complete an invocation. After writing the completed invocation status to RocksDB, we try to self-propose this message. If the PP loses leadership just before it can self-propose, this message will never be seen by the replicas, because the new leader does not check whether all completed invocations have a cleanup timer registered.

A possible solution is to let the new leader check whether there are completed invocations that don't have a cleanup timer registered (e.g. by storing a flag in the completed invocation that signals this). Alternatively, we don't rely on individual timers for cleaning up completed invocations.

Jul 12 '24 15:07 tillrohrmann

cc @slinkydeveloper

Jul 12 '24 15:07 tillrohrmann

Alternatively, we don't rely on individual timers for cleaning up completed invocations.

I wonder whether this is all in all a better solution, we register just one periodic timer when we get leadership that say every hour scans the invocation list and performs the cleanup of the expired ones

Jul 12 '24 15:07 slinkydeveloper

I wonder whether this is all in all a better solution, we register just one periodic timer when we get leadership that say every hour scans the invocation list and performs the cleanup of the expired ones

If we propagate the firing timer through the log so that also the replicas clean up all completed invocations based on the announced time, I can see this working. Given that this operation does not happen very often, it might be ok to start with a scan operation to find the completed results that can be discarded.

Jul 13 '24 16:07 tillrohrmann

My proposal for this is the following: The leader, once elected, starts a tokio task that periodically reads from rocksdb those scheduled invocations, and self proposes deletions of the old ones. This timer don't need to be durable, it can just be a regular tokio timer.

Aug 13 '24 10:08 slinkydeveloper

A slightly different variant could be that the current leader only proposes the current partition time by periodically writing the time to Bifrost. When applying the partition time message all replicas would move their partition time to the new value and can trigger some time-based operations (e.g. purging of completed invocations).

Using the partition time would be rather coarse-grained and only usable for operations that can be delayed for a bit (the expected delay would be partition time update interval / 2). For cleaning up completed invocations that usually have a retention time of hours, it could work.

The downside of this approach is that every replica would need to find the invocations that can be cleaned up. This operation can be made more efficient by maintaining some form of index over the set of completed invocations if needed.

Aug 14 '24 08:08 tillrohrmann

I'm ok with both approaches, although for the time being I would vow on proposing all the invocation ids to purge, as this won't need a new command, because Command::PurgeInvocation would be proposed, which already exists and it's correctly implemented. This makes a bit easier to solve this specific task now (also without backward/frontward compatibility related problems) and we don't design a new command which, for now, we're not sure how it will work/what will do other than cleaning up invocations.

If this becomes a problem we can always switch to the other approach later, adding the new command to propose time.

Aug 14 '24 10:08 slinkydeveloper