st2 icon indicating copy to clipboard operation
st2 copied to clipboard

Orquesta: Potential Race Condition in Nested Inquiries

Open trstruth opened this issue 5 years ago • 5 comments

SUMMARY

There appears to be a race condition in the way orquesta handles unpausing of a parent workflow when a child workflow has an inquiry which receives a response. After responding to the inquiry in the child workflow, the parent workflow can occasionally get "stuck" in a paused or a running state.

STACKSTORM VERSION

st2 3.1.0, on Python 2.7.6

OS, environment, install method

clean st2-docker installation

Steps to reproduce the problem

For a minimal reproduction, we require 2 workflows, as there needs to be a parent child relationship. The parent workflow will have only 1 task, whose action will be the child workflow. The child workflow will also have 1 task, which is an inquiry. The task transition logic in the child workflow is such that if a false response is sent to the inquiry, the child workflow will fail, causing the parent workflow to succeed. This is the expected behavior. This table shows the states of the parent and child workflows when the expected behavior is exhibited:

Parent Child Child Inquiry
requested
scheduled
running
requested
scheduled
running
requested
scheduled
running
pending
pausing
paused
pausing
paused
INQUIRY RESPONSE HERE
succeeded
resuming
running
succeeded
resuming
running
succeeded

However if the inquiry response happens before the parent workflow can enter a paused state, it can cause unexpected behavior. Consider the following:

Parent Child Child Inquiry
requested
scheduled
running
requested
scheduled
running
requested
scheduled
running
pending
INQUIRY RESPONSE HERE
succeeded
pausing
paused
pausing
paused

Here is a detailed set of steps you can follow to reproduce the issue:

  1. Follow the instructions in the st2-docker repository to create a clean deployment of stackstorm.
  2. install my report pack into the deployment. (Probably easiest to clone the repo in the packs.dev directory)
  3. register the newly installed pack
  4. First, we will run the workflow to see the expected behavior (the behavior shown in the first table above). On the stackstorm command line:
root@e2c8427e5731:~# st2 run bug.pause-parent
.
id: 5e4493ba0cdfc002b6689669
action.ref: bug.pause-parent
parameters: None
status: paused
start_timestamp: Thu, 13 Feb 2020 00:09:30 UTC
end_timestamp:
result:
  output: null
+-----------------------------+---------+-------+-----------------+-------------------------------+
| id                          | status  | task  | action          | start_timestamp               |
+-----------------------------+---------+-------+-----------------+-------------------------------+
| + 5e4493ba0cdfc0003a644e0d  | paused  | start | bug.pause-child | Thu, 13 Feb 2020 00:09:30 UTC |
|    5e4493bb0cdfc0003a644e10 | pending | wait  | core.ask        | Thu, 13 Feb 2020 00:09:31 UTC |
+-----------------------------+---------+-------+-----------------+-------------------------------+
root@e2c8427e5731:~# st2 inquiry respond 5e4493bb0cdfc0003a644e10
custom_inner_pause (boolean): False

Response accepted for inquiry 5e4493bb0cdfc0003a644e10.
root@e2c8427e5731:~# st2 execution get 5e4493ba0cdfc002b6689669
id: 5e4493ba0cdfc002b6689669
action.ref: bug.pause-parent
parameters: None
status: succeeded (14s elapsed)
start_timestamp: Thu, 13 Feb 2020 00:09:30 UTC
end_timestamp: Thu, 13 Feb 2020 00:09:44 UTC
result:
  output: null
+-----------------------------+-------------------------+-------+-----------------+-------------------------------+
| id                          | status                  | task  | action          | start_timestamp               |
+-----------------------------+-------------------------+-------+-----------------+-------------------------------+
| + 5e4493ba0cdfc0003a644e0d  | succeeded (14s elapsed) | start | bug.pause-child | Thu, 13 Feb 2020 00:09:30 UTC |
|    5e4493bb0cdfc0003a644e10 | succeeded (13s elapsed) | wait  | core.ask        | Thu, 13 Feb 2020 00:09:31 UTC |
+-----------------------------+-------------------------+-------+-----------------+-------------------------------+

As you can see, the workflow succeeded. This is the desired behavior 5. Now we will run a script which spam replies to inquiries, as to exhibit the race condition described above (the behavior shown in the second table above). The script is in the bin directory of the bug pack. Run it in the background (note that this will attempt to respond to all inquiries, so do not run this in a stackstorm environment where you have inquiries you want paused.)

root@e2c8427e5731:/opt/stackstorm/packs.dev/bug/bin# ./run.sh
./run.sh: line 3: ./autoresponder.py: No such file or directory
/opt/stackstorm/st2/local/lib/python2.7/site-packages/cryptography/hazmat/primitives/constant_time.py:26: CryptographyDeprecationWarning: Support for your Python version is deprecated. The next version of cryptography will remove support. Please upgrade to a 2.7.x release that supports hmac.compare_digest as soon as possible.
  utils.PersistentlyDeprecated2018,

This will hang, waiting for inquiries it can respond to. 6. In a new window, run the repro workflow again. The script should instantly respond to the new inquiry, and the repro workflow will be in a paused state, as opposed to succeeded:

root@e2c8427e5731:/# st2 run bug.pause-parent
..
id: 5e4496b00cdfc002b668966c
action.ref: bug.pause-parent
parameters: None
status: paused
start_timestamp: Thu, 13 Feb 2020 00:22:08 UTC
end_timestamp:
result:
  output: null
+-----------------------------+------------------------+-------+-----------------+-------------------------------+
| id                          | status                 | task  | action          | start_timestamp               |
+-----------------------------+------------------------+-------+-----------------+-------------------------------+
| + 5e4496b00cdfc0003a644e13  | paused                 | start | bug.pause-child | Thu, 13 Feb 2020 00:22:08 UTC |
|    5e4496b10cdfc0003a644e16 | succeeded (0s elapsed) | wait  | core.ask        | Thu, 13 Feb 2020 00:22:09 UTC |
+-----------------------------+------------------------+-------+-----------------+-------------------------------+

Expected Results

The workflow should always succeed, no matter the order in which the inquiry receives a response.

Actual Results

The workflow gets "stuck" in a paused state. I recognize that this appears to be a contrived example but as the complexity of the workflows/nesting grows, this issue becomes more and more prevalent. We see behavior like this nearly daily in some of our production workflows. We have built hacky workarounds to deal with this behavior but I believe it needs to be root caused. This is a minimal repro but I hope its illustrative.

Please let me know if there are any issues with the workflows/scripts, or if any further clarification is needed.

trstruth avatar Feb 13 '20 00:02 trstruth

By following the steps outlined above and following the workflowengine logs, I've extracted snippets which I think illustrate the race condition I'm describing above. I've removed the timestamp information for readability but links to the entire log of both can be found at the bottom. Comments denoted by # are my own.

The workflow behaving as expected (no instant inquiry response script):

INFO workflows [-] [5e4b1b0da72c7d00306e5709] No tasks identified to execute next.
INFO workflows [-] [5e4b1b0da72c7d00306e5709] Action execution "5e4b1b0ea72c7d00306e570c" for task "wait" is updated and in "scheduled" state.
INFO workflows [-] [5e4b1b0da72c7d00306e5709] Action execution "5e4b1b0ea72c7d00306e570c" for task "wait" is updated and in "running" state.
INFO workflows [-] [5e4b1b0da72c7d00306e5709] Action execution "5e4b1b0ea72c7d00306e570c" for task "wait" is updated and in "pending" state.
INFO workflows [-] [5e4b1b0da72c7d00306e5709] Handling pending of action execution "5e4b1b0ea72c7d00306e570c" for task "wait", route "0".
INFO workflows [-] [5e4b1b0da72c7d00306e5709] Publish task "wait", route "0", with status "pending" to conductor.
INFO workflows [-] [5e4b1b0da72c7d00306e5709] Updating workflow execution from status "running" to "paused".
INFO workflows [-] [5e4b1b0ca72c7d014d981804] Action execution "5e4b1b0da72c7d00306e5709" for task "start" is updated and in "paused" state.
INFO workflows [-] [5e4b1b0ca72c7d014d981804] Handling pause of action execution "5e4b1b0da72c7d00306e5709" for task "start", route "0".
INFO workflows [-] [5e4b1b0ca72c7d014d981804] Publish task "start", route "0", with status "paused" to conductor.
INFO workflows [-] [5e4b1b0ca72c7d014d981804] Updating workflow execution from status "running" to "paused".    # start pauses before inquiry gets a response
INFO workflows [-] [5e4b1b0ca72c7d014d981804] Action execution "5e4b1b0da72c7d00306e5709" for task "start" is updated and in "resuming" state.
INFO workflows [-] [5e4b1b0da72c7d00306e5709] Action execution "5e4b1b0ea72c7d00306e570c" for task "wait" is updated and in "succeeded" state.
INFO workflows [-] [5e4b1b0da72c7d00306e5709] Processing request for workflow execution.
INFO workflows [-] [5e4b1b0da72c7d00306e5709] Identifying next set (0) of tasks for workflow execution in status "resuming".

The bugged behavior (inquiry response landing before parent workflow can unpause):

INFO workflows [-] [5e4b1befa72c7d00306e570f] No tasks identified to execute next.
INFO workflows [-] [5e4b1befa72c7d00306e570f] Action execution "5e4b1bf0a72c7d00306e5712" for task "wait" is updated and in "scheduled" state.
INFO workflows [-] [5e4b1befa72c7d00306e570f] Action execution "5e4b1bf0a72c7d00306e5712" for task "wait" is updated and in "running" state.
INFO workflows [-] [5e4b1befa72c7d00306e570f] Action execution "5e4b1bf0a72c7d00306e5712" for task "wait" is updated and in "pending" state.
INFO workflows [-] [5e4b1befa72c7d00306e570f] Handling pending of action execution "5e4b1bf0a72c7d00306e5712" for task "wait", route "0".
INFO workflows [-] [5e4b1befa72c7d00306e570f] Publish task "wait", route "0", with status "pending" to conductor.
INFO workflows [-] [5e4b1befa72c7d00306e570f] Updating workflow execution from status "running" to "paused".
INFO workflows [-] [5e4b1befa72c7d00306e570f] Action execution "5e4b1bf0a72c7d00306e5712" for task "wait" is updated and in "succeeded" state.
INFO workflows [-] [5e4b1befa72c7d00306e570f] Handling completion of action execution "5e4b1bf0a72c7d00306e5712" in status "succeeded" for task "wait", route "0".    # wait succeeds before task start goes into paused state
INFO workflows [-] [5e4b1beea72c7d014d981807] Action execution "5e4b1befa72c7d00306e570f" for task "start" is updated and in "paused" state.
INFO workflows [-] [5e4b1beea72c7d014d981807] Handling pause of action execution "5e4b1befa72c7d00306e570f" for task "start", route "0".
INFO workflows [-] [5e4b1beea72c7d014d981807] Publish task "start", route "0", with status "paused" to conductor.
INFO workflows [-] [5e4b1beea72c7d014d981807] Updating workflow execution from status "running" to "paused".
INFO workflows [-] [5e4b1befa72c7d00306e570f] Publish task "wait", route "0", with status "succeeded" to conductor.
INFO workflows [-] [5e4b1befa72c7d00306e570f] Identifying next set (0) of tasks after completion of task "wait", route "0".
INFO workflows [-] [5e4b1befa72c7d00306e570f] No tasks identified to execute next.

@m4dcoder these logs seem consistent with my theory of the existence of a race. Could a potential solution be to "lock" a workflow and its children until the pausing/unpausing states propagate completely?

complete logs: correct: https://gist.github.com/trstruth/25647306f41ad5bb7893c6a851cf8325 incorrect: https://gist.github.com/trstruth/aacce99d7314658b12fa2e7c5ae5a8c1

trstruth avatar Feb 17 '20 23:02 trstruth

Hi, @trstruth
I have a problem similar to yours, Can you describe your simple solution?

i-cooltea avatar Aug 05 '20 02:08 i-cooltea

@i-cooltea I don't know of an existing solution to this problem, maybe @m4dcoder or @armab could weigh in?

trstruth avatar Aug 05 '20 07:08 trstruth

@trstruth

I mean what you said that 'We have built hacky workarounds to deal with this behavior '

i-cooltea avatar Aug 05 '20 07:08 i-cooltea

We found a solution.

We resolved the problem by checking the parent workflow of the inquiry, waiting until it will reach a stable state, and then we initiate the inquiry.

We are getting the parent ID of the inquiry with the inquiry ID and https://api.stackstorm.com/api/v1/executions/#/action_executions_controller.get_one API call, taking the “parent” attribute. Then we are checking if the parent is stable with the parent ID and https://api.stackstorm.com/api/v1/executions/#/action_executions_controller.get_one API call, taking the “status” attribute. waiting until it will reach a stable state and then proceeding to initiate the inquiry.

@armab @m4dcoder

This should be fixed in the infrastructure by adding this mechanism before the inquiries initiation in this endpoint https://api.stackstorm.com/api/v1/inquiries/#/inquiries_controller.put.

DorAbu avatar Dec 13 '21 09:12 DorAbu