Add support for bulk cancellations/kills/pause/unpause
In certain scenarios where one has accumulated many invocations that cannot complete, it would be very helpful to have the functionality to bulk cancel/kill those invocations. The criterion could be the creation timestamp of the invocation: Cancel/kill all invocations older than 30 minutes, for example. Ideally, this operation does not need to scan through all available invocations because if the cluster is already stressed, these queries would add even more stress.
Other criterias we heard from users:
- by service
- by service handler
- by object key
- by deployment id -> this is asked to decommission an endpoint (in the context of #3144
Some of those require broadcast to the whole cluster obviously
To retain a good amount of flexibility, also when we integrate this with the UI (something to be understood while we design this feature), it would be awesome to design in a more or less generic way the "bulk" operations:
- Given a filter (maybe even in SQL!)
- Given the operation I want to execute
- Execute it
Beware that not all operations are wal commands, some require to run an RPC Handler first (see Restart as new and pause and resume) or are entirely handled by an RPC Handler (don't have the example in the code yet, but Pause will be like that)
Bulk unpausing could also be a really helpful operation.
I would like to start this from the UI, and then infer from there the implementation details in the admin api. I opened an issue there.
We can certainly start with the UI, but it’s important to note that running an action against a query is quite different from running an action against multiple selected invocations (e.g., from tables). In each case, the target of the action should be absolutely clear.
We can begin with the multiple-invocations version, but that doesn’t necessarily mean the same UI or implementation can be reused for query-based actions
Answered on the other thread.
https://github.com/restatedev/restate/pull/3943 makes much more usable the CLI command to kill/cancel/pause/etc in batch, I can kill 2 milion invocations easily: takes roughly a minute, but I mean it's a long operation and gives you back the feedback "was killed or not", so i would say it's an ok tradeoff for now. There is an open issue for the UI side of things, to have a similar approach to the CLI one.
I would say let's finish those efforts and put them out, right now I have the feeling those might be "good enough", without the need for something more sophisticated on the Admin API with filters and stuff...
@tillrohrmann this is roughly the spec for the ad-hoc UI endpoint
pub struct BatchOperationRequest {
pub invocations: Vec<InvocationId>
}
pub struct BatchOperationResult {
pub succeeded: Vec<InvocationId>,
pub failed: Vec<InvocationId>
}
// Mounted on /internal/invocations_batch_operations/kill
pub async fn batch_kill_invocations(
#[request_body(required = true)] Json(payload): Json<BatchOperationRequest>,
) -> Result<BatchOperationResult, BatchInvocationOperationError> {
// Behavior:
// * Try to execute kill on all invocations, using the regular InvocationClient as done above, in parallel
// * Don't fail fast on failures, just collect the failed and succeeded ones to display. Don't propagate failures for now (might be big!).
// If the user cares about eventual errors, they can re-run the operation
// * If the admin api request disconnects, no problem the batch operation can be canceled
// * I guess this operation can fail if the admin api cannot reach the rest of the cluster, or none of the given invocations succeed, or smth like that 🤷
}
// TODO same format for all the other ops (cancel, kill, pause, resume, restart-as-new, purge, purge-journal)
As discussed, let's build this only for the UI for now, and let's not use this in CLI, so we can freely modify it later on.
In the context of sorting this one out, would be nice to understand the maximum size the batch can have. I remember when playing in CLI I discovered a magic number 1011, but didn't figure out where it came from... https://github.com/restatedev/restate/pull/3943/files#diff-40d5e045fe7c0daee65b84a01c8495ddc8a4eecec172e71edbc0c12f9df06383R28
It's ok to hardcode a limit here, and have the batch operation endpoint return 400 if the number of given invocations is too big. The UI will do its own batching anyway in order to show the progress bar.