st2 Death of st2actionrunner process causes action to remain running forever

SUMMARY

Using StackStorm 3.0.1, if something kills an st2actionrunner process supervising a python-script action runner, and this action execution is part of a workflow execution, the action execution remains forever in state running regardless of parameters: timeout setting in the workflow.

What I'd like to see, is the action being rescheduled to another st2actionrunner, or at the very least, timed out so that a retry in the workflow can deal with that problem.

(It is not clear either how StackStorm deals with the death of a st2actionrunner supervising an orquesta action runner.)

This is not an HA setup, but nothing in the code or documentation leads me to believe that the expected behavior is to just hang a workflow execution when the underlying action runner supervisor process is gone. I'm thinking a machine in an HA setup crashes while ongoing workflows are executing actions on that machine, and then all workflows whose actions were running there, just hang, never to even timeout.

We expect to be able to run StackStorm for weeks on end, with long-running workflows that survive the death or reboot of a machine that is part of the StackStorm cluster.

OS / ENVIRONMENT / INSTALL METHOD

Standard non-HA recommended setup in Ubuntu 16.04

STEPS TO REPRODUCE

Create workflow with one Python action that runs sleep 60 via subprocess. Start workflow with st2 run. Kill st2actionrunner supervising the Python action. Wait forever.

Jun 21 '19 00:06 Rudd-O

If the action runner process dies unexpectedly while an action execution is still executing, the action execution will be stuck in a running state because the action runner process didn't get a chance to update the database. We've recently added service discovery capability to the action runner. We will be adding garbage collection shortly to clean up these orphaned action executions and set them to something like an abandoned status. This will trigger the workflow execution to fail. When implemented, the service discovery feature will require user to configure a coordination backend such as redis server to work along side StackStorm.

Jun 21 '19 07:06 m4dcoder

That sounds like a plan. Thanks.

Meanwhile, how can I get abandoned actions to restart? That is crucial because most of our workflows run for a month or more, so if a box gets slammed, we see our workflows either fail or become stuck. It would be okay in our case to restart the specific failed action from the top, because our actions are all idempotent.

Jun 24 '19 13:06 Rudd-O

For the orquesta workflow engine, rerunning/restarting WF from failed task is not supported yet. It is currently WIP and planned for a future release.

Jun 24 '19 19:06 m4dcoder

Triggering the workflow to fail after one of the child tasks is abandoned seems like a sane default, but in many cases I'd like to be given the option to "retry" the abandoned task since many of our actionrunner failures are due to transient issues. Rerunning/rehydrating a workflow from a given state would be essentially equivalent to this.

Jun 24 '19 19:06 trstruth

Rerunning/rehydrating a workflow from a given state would be essentially equivalent to this.

This would be adequate for our use cases. Otherwise Orquesta basically makes it impossible to put the machines running the workflow engine in maintenance mode.

We would prefer the workflow's complete state (including published variables and current threads) be captured in a persistent manner within the database, such that the workflow can restart if the workflow engine is moved to a different box. This would be essentially what Jenkins does w.r.t. pipelines when the master restarts -- it persists the state of the pipelines, then when it reconnects to slaves, it catches up with what the slaves were doing.

Jun 26 '19 04:06 Rudd-O

I think there are different things being communicated here. As I understand, 1) there is the case where the action runner dies while executing an action for a task and that leads to the task and workflow stuck in running state. This is the original issue here. 2) You want to be able to rerun the task when the task and workflow execution failed as a result of 1. 3) You want to be able to pause the workflow execution, bring up another workflow engine, and resume the execution on the new server.

Item 3 already works today. You can pause the workflow execution. The state of the execution is saved to MongoDB. Then you can bring up a new workflow engine using the same st2.conf and shut down the old workflow engine. Resume the workflow execution and the workflow engine will pick up where it left off. If you are running different versions of st2, then be careful there are no breaking changes in between versions.

For item 1, per the solution described above, we plan to implement garbage collection and will abandoned action execution where the action runner that host it dies but the execution is stuck in running.

For item 2, we have a WIP feature to rerun a workflow execution from one or more failed tasks. We will make sure this supports item 1 where the action execution is abandoned.

Jun 27 '19 04:06 m4dcoder

Awesome. Now if the mistral language supported retries it'd be so awesome. Ultimately the big concern is that sometimes stopping st2 components cause actions to get stuck in state running and never end / become unpauseable or uncancellable. Hope this gets ironed out.

On June 27, 2019 6:34:57 AM GMT+02:00, W Chan [email protected] wrote:

I think there are different things being communicated here. As I understand, 1) there is the case where the action runner dies while executing an action for a task and that leads to the task and workflow stuck in running state. This is the original issue here. 2) You want to be able to rerun the task when the task and workflow execution failed as a result of #1. 3) You want to be able to pause the workflow execution, bring up another workflow engine, and resume the execution on the new server.

Item 3 already works today. You can pause the workflow execution. The state of the execution is saved to MongoDB. Then you can bring up a new workflow engine using the same st2.conf and shut down the old workflow engine. Resume the workflow execution and the workflow engine will pick up where it left off. If you are running different versions of st2, then be careful there are no breaking changes in between versions.

For item 1, per the solution described above, we plan to implement garbage collection and will abandoned action execution where the action runner that host it dies but the execution is stuck in running.

For item 2, we have a WIP feature to rerun a workflow execution from one or more failed tasks. We will make sure this supports item 1 where the action execution is abandoned.

-- You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: https://github.com/StackStorm/st2/issues/4716#issuecomment-506153439

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

Jul 25 '19 01:07 Rudd-O

@m4dcoder can you explain the recently added GC process when a workflow task is stuck? i.e. if an actionrunner machine reboots.

I'm reading the code and it appears that if a task gets stuck, the GC will kill the whole workflow execution. If this is the case, I don't think that is desired behavior. I think the stuck task should fail, but the workflow should be able to handle the failure, with a retry or other workflow path.

Aug 09 '19 01:08 johnarnold

Also, I think the action runner and workflow engine need to support a "warm shutdown" TERM signal to the process. The idea being that they should finish their work before they exit, minimizing orphaned actions or lost workflow state.

For the workflow engine, this may mean initiating a pausing/paused before shutting the process.

For the action runner, this may mean that it stops accepting any new work, and completes it's current running work before exiting (with a hard timeout value).

We use this type of behavior for Celery workers today. See: http://docs.celeryproject.org/en/master/userguide/workers.html#stopping-the-worker

Aug 09 '19 02:08 johnarnold

+1, in Kubernetes environment all services die, respawn and get rescheduled to another nodes on a regular basis.

It's a reality requirements to make StackStorm handle these cases as normal situation, especially thinking about it in HA context.

Aug 09 '19 13:08 arm4b

Currently, GC will cancel the workflow execution if it has been pass max idle time without any activity (i.e. active == task execution still executing). The current GC does not cover the use cases where action runner died/reboot in the meeting of executing an action. When this happens, the action execution will be stuck in a running state and so is the corresponding task execution record. This will not trigger the current GC to clean up the workflow execution. Note this GC functionality is disabled by default in v3.1.

per the solution described above, we plan to implement garbage collection and will abandoned action execution where the action runner that host it dies but the execution is stuck in running.

Per the solution here for this issue which we haven't implement yet, when GC abandons the action execution, it has the same affect in failing the action execution and the task execution which will trigger whatever clean up defined in the workflow definition.

Aug 09 '19 19:08 m4dcoder

@m4dcoder ok, is anyone working on GC for the action exection / actionrunner restart scenario?

Aug 09 '19 19:08 johnarnold

This is not currently prioritized for the next v3.2 release and we already started on v3.2. If this is something the community needs, st2 is open source and we welcome contribution. We will dedicate time to help and review with code contribution.

Aug 09 '19 20:08 m4dcoder

Hi there - Checking whether this is still on the road map for any releases near soon? This requirement has real significance in the stackstorm-ha world especially when the nodes/pods get killed/restarted often in the k8s world compared to the traditional deployment model.

Jun 20 '22 12:06 anrajme

Yeah. Kubernetes rollouts of new packs (using the st2packs sidecar containers built for the purpose) restart action runners, which means the restarted action runners usually leave actions behind, "running" as ghosts, which obviously torpedoes our long-running workflows. The garbage collector also does not collect tasks in "running" state by default either. And tasks whose executors have gone AWOL simply cannot be canceled from the UI (it directs the user to look at the developer console, see screenshot).

To add to that complication, default retry behavior "when task is abandoned" is still not implemented in orquesta, when a transient failure of this type happens either -- we end up having to code retries on every task in each workflow, which is relatively easy to screw up for us workflow developers.

These deficiencies make StackStorm usage in a modern production environment a very difficult pitch. Truly great in theory -- in practice very painful to deploy and maintain.

Jun 05 '23 13:06 DFINITYManu

Hi, is there any updates on this ticket? My team is trying to deploy StackStorm HA but we are running into this issue which isn't acceptable for our use case. :(

Jul 19 '23 20:07 bell-manz

If the kill signal is sent to an actionrunner it should wait till the action finishes if you have graceful shutdown on. Do you have gracefull shutdown enabled in the config? There is also an exit timeout and sleep delay setting.code

Jul 19 '23 22:07 guzzijones

looks like you have to also increase the terminationGracePeriodSeconds in your chart. The default is 30 seconds.

Jul 19 '23 22:07 guzzijones

Looks like there are similar settings for the workflowengine. again you will also have to set the terminationGracePeriodSeconds in your chart to a sane time.

Jul 19 '23 23:07 guzzijones

Make sure that your timeouts are all set correct vs your action timeouts.

Most of our actions timeout after 10 minutes. So we set the following to at least allow the action timeouts to trigger before the graceful shutdowns.

action timeouts: 600 seconds for most of our actions. we have a couple set to 900, but we want to cover most.
actionrunner: gracefull timeout settings in values.yaml in config settings. set it to a bit longer than the action timeout.

 [actionrunner]
      graceful_shutdown = True
      exit_still_active_check = 610
      still_active_check_interval = 10

termination Grace Period Seconds in values.haml for action runner

 st2actionrunner:
    terminationGracePeriodSeconds: 630

With these settings we are hoping that worst case actions actually get abandoned because the actionrunner shutdown method will have enough time to abandon them before the k8 pod timeout hits 20 seconds later

Also we set the st2workflowengine timeouts:

  st2workflowengine:
    terminationGracePeriodSeconds: 630

Also, when you run helm upgrade be sure to extend the timeout using --timeout 20m

Oct 31 '23 15:10 guzzijones

2 more notes:

do NOT put inline comments in your config file
you must enable coordination | service_registry = True in the config for graceful shutdown to wait for actions to finish.

Dec 06 '23 19:12 guzzijones

Why the sensible options being discussed here are not defaults, it's a mystery to me.

Dec 20 '23 10:12 Rudd-O

st2 st2 copied to clipboard

Death of st2actionrunner process causes action to remain running forever

SUMMARY

OS / ENVIRONMENT / INSTALL METHOD

STEPS TO REPRODUCE

st2
st2 copied to clipboard