prefect
prefect copied to clipboard
Cancellation of flow runs that correspond to deleted deployments
First check
- [X] I added a descriptive title to this issue.
- [X] I used the GitHub search to find a similar issue and didn't find it.
- [X] I searched the Prefect documentation for this issue.
- [X] I checked that this issue is related to Prefect and not one of its dependencies.
Bug summary
When deleting a deployment while a corresponding flow run is running, the flow run cannot be canceled anymore as the cancellation will result in an endless loop on the worker. This is not only problematic because the flow run will keep running but also because the flow run will stay in "Cancelling" indefinitely, eating up resources on the worker and unnecessarily loading the Prefect API.
If a flow run is canceled after the corresponding deployment was deleted, getting the configuration for a flow run that corresponds to the deleted deployment in cancel_run
will fail with ObjectNotFound
and a warning will be emitted.
https://github.com/PrefectHQ/prefect/blob/747a51503e3ad65c500e0ac66ff52daf8dc0c0cf/src/prefect/workers/base.py#L635
kill_infrastructure
will still be called with the configuration
parameter as it is in a different try
block. This causes an UnboundLocalError
error as configuration
is not defined.
https://github.com/PrefectHQ/prefect/blob/747a51503e3ad65c500e0ac66ff52daf8dc0c0cf/src/prefect/workers/base.py#L668
The UnboundLocalError
is eventually caught, the flow run is removed from cancelling_flow_run_ids
, another warning message is emitted but the state of the flow run remains unchanged.
https://github.com/PrefectHQ/prefect/blob/747a51503e3ad65c500e0ac66ff52daf8dc0c0cf/src/prefect/workers/base.py#L686
This causes an endless loop where the same flow run will be picked up for cancellation on the next iteration of the service loop, only to fail in exactly the same manner.
Same applies for flows that are in "Awaiting Retry" and are linked to a deleted deployment.
Apart from the flow run not being canceled, over time these flow runs in "Cancelling" will pile up even after the flow run is finished (because the flow run won't be able to transition its state to "Failed" or "Completed" when done due to state transition rules), causing unnecessary load and traffic on the worker and making the worker unstable as cancellation is retried for all of them on each iteration.
As the state of a flow run cannot be changed in the UI if it is in the state "Cancelling", the only way to break out of this is by setting the state of the flow runs directly via the Prefect API or deleting the flow runs entirely.
Possible solutions:
- Set flow run to "Cancellation failed" state to avoid cancellation retries indefinitely. Would solve the infinite loop problem but still not cancel the flow run properly
-
configuration
inkill_infrastructure
is only required for certain worker types. Passing configuration only if available would enable cancellation even if no deployment is available anymore, at least for certain worker types.
Reproduction
- Deploy flow
- Start flow
- Delete deployment
- Cancel flow
- See worker logs for warnings popping up on every service loop iteration
Error
No response
Versions
Version: 2.16.5
API version: 0.8.4
Python version: 3.10.8
Git commit: 6d0ad745
Built: Thu, Mar 21, 2024 3:41 PM
OS/Arch: linux/x86_64
Server type: cloud
Additional context
No response