stackstorm-k8s icon indicating copy to clipboard operation
stackstorm-k8s copied to clipboard

Issues while scaling down the nodes

Open anrajme opened this issue 3 years ago • 2 comments

Hi there -

We had a few issues lately while the underlying K8s nodes scaled down. During this event, the pods are being evicted ( killed and recreated) on another node which is expected. However, stackstorm-ha reported a few issues. Initially, it was with the stateful set where the RabbitMQ node failures causing events stuck in a "Schduled" status forever. I'm trying to get rid of this trouble by shifting the RabbitMQ service to a Managed cloud service provider.

Now, the recent problem is with st2actionrunner, where the pod get evicted while executing a workflow. The event has been marked as "abandoned" and the workflow execution failed.

image
# st2 execution get 62b019ba420e073fb8f432c3
id: 62b019ba420e073fb8f432c3
action.ref: jira.update_field_value
context.user: xxxxx
parameters:
  field: customfield_14297
  issue_key: xx-96233
  value: Closing Jira 
status: abandoned
start_timestamp: Mon, 20 Jun 2022 06:54:50 UTC
end_timestamp:
log:
  - status: requested
    timestamp: '2022-06-20T06:54:50.171000Z'
  - status: scheduled
    timestamp: '2022-06-20T06:54:50.348000Z'
  - status: running
    timestamp: '2022-06-20T06:54:50.408000Z'
  - status: abandoned
    timestamp: '2022-06-20T06:54:50.535000Z'
result: None

In this case, though we still had another 4 healthy actionrunners running while the one failed where the workflow was executed.

Wondering whether this is expected behaviour and is acceptable for stackstorm-ha architecture ?

cheers!

anrajme avatar Jun 20 '22 07:06 anrajme

Somewhat similar: https://github.com/StackStorm/st2/issues/4716 It's an issue with the stackstorm engine itself handling the sudden stop of the actionrunners which were running tasks in the workflow.

arm4b avatar Jun 20 '22 11:06 arm4b

Thanks @armab. I have updated in the original issue https://github.com/StackStorm/st2/issues/4716 . Looks like this is going to be a game-changer requirement, especially in the k8s ha environment where node/pod kill/restarts are comparatively more frequent than the traditional deployment model.

anrajme avatar Jun 20 '22 13:06 anrajme

Closing as a duplicate of StackStorm/st2#4716

cognifloyd avatar Jan 28 '23 04:01 cognifloyd