restate icon indicating copy to clipboard operation
restate copied to clipboard

Add support for bulk cancellations/kills/pause/unpause

Open tillrohrmann opened this issue 2 months ago • 7 comments

In certain scenarios where one has accumulated many invocations that cannot complete, it would be very helpful to have the functionality to bulk cancel/kill those invocations. The criterion could be the creation timestamp of the invocation: Cancel/kill all invocations older than 30 minutes, for example. Ideally, this operation does not need to scan through all available invocations because if the cluster is already stressed, these queries would add even more stress.

tillrohrmann avatar Oct 02 '25 16:10 tillrohrmann

Other criterias we heard from users:

  • by service
  • by service handler
  • by object key
  • by deployment id -> this is asked to decommission an endpoint (in the context of #3144

Some of those require broadcast to the whole cluster obviously

slinkydeveloper avatar Oct 02 '25 16:10 slinkydeveloper

To retain a good amount of flexibility, also when we integrate this with the UI (something to be understood while we design this feature), it would be awesome to design in a more or less generic way the "bulk" operations:

  • Given a filter (maybe even in SQL!)
  • Given the operation I want to execute
  • Execute it

Beware that not all operations are wal commands, some require to run an RPC Handler first (see Restart as new and pause and resume) or are entirely handled by an RPC Handler (don't have the example in the code yet, but Pause will be like that)

slinkydeveloper avatar Oct 02 '25 16:10 slinkydeveloper

Bulk unpausing could also be a really helpful operation.

tillrohrmann avatar Oct 10 '25 11:10 tillrohrmann

I would like to start this from the UI, and then infer from there the implementation details in the admin api. I opened an issue there.

slinkydeveloper avatar Oct 13 '25 14:10 slinkydeveloper

We can certainly start with the UI, but it’s important to note that running an action against a query is quite different from running an action against multiple selected invocations (e.g., from tables). In each case, the target of the action should be absolutely clear.

We can begin with the multiple-invocations version, but that doesn’t necessarily mean the same UI or implementation can be reused for query-based actions

nikrooz avatar Oct 13 '25 16:10 nikrooz

Answered on the other thread.

slinkydeveloper avatar Oct 13 '25 16:10 slinkydeveloper

https://github.com/restatedev/restate/pull/3943 makes much more usable the CLI command to kill/cancel/pause/etc in batch, I can kill 2 milion invocations easily: takes roughly a minute, but I mean it's a long operation and gives you back the feedback "was killed or not", so i would say it's an ok tradeoff for now. There is an open issue for the UI side of things, to have a similar approach to the CLI one.

I would say let's finish those efforts and put them out, right now I have the feeling those might be "good enough", without the need for something more sophisticated on the Admin API with filters and stuff...

slinkydeveloper avatar Nov 04 '25 18:11 slinkydeveloper

@tillrohrmann this is roughly the spec for the ad-hoc UI endpoint

pub struct BatchOperationRequest {
    pub invocations: Vec<InvocationId>
}

pub struct BatchOperationResult {
    pub succeeded: Vec<InvocationId>,
    pub failed: Vec<InvocationId>
}

// Mounted on /internal/invocations_batch_operations/kill
pub async fn batch_kill_invocations(
    #[request_body(required = true)] Json(payload): Json<BatchOperationRequest>,
) -> Result<BatchOperationResult, BatchInvocationOperationError> {
    // Behavior:
    // * Try to execute kill on all invocations, using the regular InvocationClient as done above, in parallel
    // * Don't fail fast on failures, just collect the failed and succeeded ones to display. Don't propagate failures for now (might be big!).
    //      If the user cares about eventual errors, they can re-run the operation
    // * If the admin api request disconnects, no problem the batch operation can be canceled
    // * I guess this operation can fail if the admin api cannot reach the rest of the cluster, or none of the given invocations succeed, or smth like that 🤷 
}

// TODO same format for all the other ops (cancel, kill, pause, resume, restart-as-new, purge, purge-journal)

As discussed, let's build this only for the UI for now, and let's not use this in CLI, so we can freely modify it later on.

In the context of sorting this one out, would be nice to understand the maximum size the batch can have. I remember when playing in CLI I discovered a magic number 1011, but didn't figure out where it came from... https://github.com/restatedev/restate/pull/3943/files#diff-40d5e045fe7c0daee65b84a01c8495ddc8a4eecec172e71edbc0c12f9df06383R28

It's ok to hardcode a limit here, and have the batch operation endpoint return 400 if the number of given invocations is too big. The UI will do its own batching anyway in order to show the progress bar.

slinkydeveloper avatar Dec 11 '25 09:12 slinkydeveloper