prefect icon indicating copy to clipboard operation
prefect copied to clipboard

Cancellation of flow runs that correspond to deleted deployments

Open j-tr opened this issue 11 months ago • 2 comments

First check

  • [X] I added a descriptive title to this issue.
  • [X] I used the GitHub search to find a similar issue and didn't find it.
  • [X] I searched the Prefect documentation for this issue.
  • [X] I checked that this issue is related to Prefect and not one of its dependencies.

Bug summary

When deleting a deployment while a corresponding flow run is running, the flow run cannot be canceled anymore as the cancellation will result in an endless loop on the worker. This is not only problematic because the flow run will keep running but also because the flow run will stay in "Cancelling" indefinitely, eating up resources on the worker and unnecessarily loading the Prefect API.

If a flow run is canceled after the corresponding deployment was deleted, getting the configuration for a flow run that corresponds to the deleted deployment in cancel_run will fail with ObjectNotFound and a warning will be emitted. https://github.com/PrefectHQ/prefect/blob/747a51503e3ad65c500e0ac66ff52daf8dc0c0cf/src/prefect/workers/base.py#L635

kill_infrastructure will still be called with the configuration parameter as it is in a different try block. This causes an UnboundLocalError error as configuration is not defined. https://github.com/PrefectHQ/prefect/blob/747a51503e3ad65c500e0ac66ff52daf8dc0c0cf/src/prefect/workers/base.py#L668

The UnboundLocalError is eventually caught, the flow run is removed from cancelling_flow_run_ids, another warning message is emitted but the state of the flow run remains unchanged. https://github.com/PrefectHQ/prefect/blob/747a51503e3ad65c500e0ac66ff52daf8dc0c0cf/src/prefect/workers/base.py#L686

This causes an endless loop where the same flow run will be picked up for cancellation on the next iteration of the service loop, only to fail in exactly the same manner.

Same applies for flows that are in "Awaiting Retry" and are linked to a deleted deployment.

Apart from the flow run not being canceled, over time these flow runs in "Cancelling" will pile up even after the flow run is finished (because the flow run won't be able to transition its state to "Failed" or "Completed" when done due to state transition rules), causing unnecessary load and traffic on the worker and making the worker unstable as cancellation is retried for all of them on each iteration.

As the state of a flow run cannot be changed in the UI if it is in the state "Cancelling", the only way to break out of this is by setting the state of the flow runs directly via the Prefect API or deleting the flow runs entirely.

Possible solutions:

  1. Set flow run to "Cancellation failed" state to avoid cancellation retries indefinitely. Would solve the infinite loop problem but still not cancel the flow run properly
  2. configuration in kill_infrastructure is only required for certain worker types. Passing configuration only if available would enable cancellation even if no deployment is available anymore, at least for certain worker types.

Reproduction

- Deploy flow
- Start flow
- Delete deployment
- Cancel flow
- See worker logs for warnings popping up on every service loop iteration

Error

No response

Versions

Version:             2.16.5
API version:         0.8.4
Python version:      3.10.8
Git commit:          6d0ad745
Built:               Thu, Mar 21, 2024 3:41 PM
OS/Arch:             linux/x86_64
Server type:         cloud

Additional context

No response

j-tr avatar Mar 22 '24 13:03 j-tr