remove some deepcopy to speed up workflow conductor
- removing deepcopy where we can
- change deepcopy where still needed to custom copy command to increase speed using instead a shallow copy
- add some data classes * tsc meeting comment
The use of 'latest' is causing my tox unit testing to fail it appears.
orquesta.exceptions.PluginFactoryError: Unable to load plugin orquesta.composers.native. No 'orquesta.composers' driver found, looking for 'native'
I also had to disable python 3.6 for now.
@m4dcoder can you please take a look at this python 3.8 test failing. It looks like stevedore is not able to find the native, mock plugins to run the tests in python 3.8.
I did some testing with st2 and this definitely breaks with-items no longer report failure to the parent workflow.
(virtualenv) [vagrant@stackstorm]~/st2/tools$ st2 execution get 64adc15887cff69baae67294
id: 64adc15887cff69baae67294
action.ref: test_pack.test_fail_workflow
parameters:
number: 5
status: running (635s elapsed)
start_timestamp: Tue, 11 Jul 2023 20:53:44 UTC
end_timestamp:
log:
- status: requested
timestamp: '2023-07-11T20:53:44.168000Z'
- status: scheduled
timestamp: '2023-07-11T20:53:44.245000Z'
- status: running
timestamp: '2023-07-11T20:53:44.317000Z'
result:
output: null
+--------------------------+------------------------+-----------+---------------------+-------------------------------+
| id | status | task | action | start_timestamp |
+--------------------------+------------------------+-----------+---------------------+-------------------------------+
| 64adc1581982b0a045f91e70 | succeeded (0s elapsed) | do_action | test_pack.test_fail | Tue, 11 Jul 2023 20:53:44 UTC |
| 64adc1591982b0a045f91e85 | succeeded (0s elapsed) | with_test | test_pack.test_fail | Tue, 11 Jul 2023 20:53:45 UTC |
| 64adc1591982b0a045f91e88 | succeeded (1s elapsed) | with_test | test_pack.test_fail | Tue, 11 Jul 2023 20:53:45 UTC |
| 64adc1591982b0a045f91e8b | failed (1s elapsed) | with_test | test_pack.test_fail | Tue, 11 Jul 2023 20:53:45 UTC |
| 64adc1591982b0a045f91e8f | succeeded (1s elapsed) | with_test | test_pack.test_fail | Tue, 11 Jul 2023 20:53:45 UTC |
| 64adc1591982b0a045f91e92 | succeeded (1s elapsed) | with_test | test_pack.test_fail | Tue, 11 Jul 2023 20:53:45 UTC |
+--------------------------+------------------------+-----------+---------------------+-------------------------------+
staged tasks need to be copied or with-items tasks will remain stuck if one sub action fails
I was actually able to get the json_8mb.json through this workflow with this change along with the changes from the zstandard compression over in the st2 PR queue
st2 run test_pack.test_workflow @json=/home/vagrant/json_8mb.json
action
from st2common.runners.base_action import Action
class Process(Action):
def run(self, *args, **kwargs):
return kwargs["json"]
workflow
version: 1.0
tasks:
do_action:
action: test_pack.test_action
input:
json: <% ctx().json %>
next:
- when: <%succeeded()%>
publish:
- bar: [1,2,3]
do:
- with_test
with_test:
with:
items: <% ctx().bar %>
concurrency: 1
action: test_pack.test_action
input:
json: <% ctx().json %>
next:
- when: <% succeeded() %>
publish:
- with_output: <% result().result %>
input:
- json
output:
- with_output: <% ctx().with_output %>
output
+--------------------------+-------------------------+-----------+-----------------------+--------------------+
| id | status | task | action | start_timestamp |
+--------------------------+-------------------------+-----------+-----------------------+--------------------+
| 64af43382c572b699fca6ea5 | succeeded (68s elapsed) | do_action | test_pack.test_action | Thu, 13 Jul 2023 |
| | | | | 00:20:06 UTC |
| 64af43d42c572b699fca6eb6 | succeeded (74s elapsed) | with_test | test_pack.test_action | Thu, 13 Jul 2023 |
| | | | | 00:22:43 UTC |
| 64af44722c572b699fca6ec7 | succeeded (79s elapsed) | with_test | test_pack.test_action | Thu, 13 Jul 2023 |
| | | | | 00:25:20 UTC |
| 64af45122c572b699fca6ed8 | succeeded (68s elapsed) | with_test | test_pack.test_action | Thu, 13 Jul 2023 |
| | | | | 00:28:01 UTC |
+--------------------------+-------------------------+-----------+-----------------------+--------------------+
I was able to get the st2 unit tests to also pass. I had to leave in the deep copy for the errors. I am testing this locally on our dev box.
Do we have any numbers on how much things are speed up by this change? Ideally, some micro + end to end benchmarks would be great (similar to what I've done for the DB layer changes).
When I was originally working on the database layer performance improvements and I implemented fast_deepcopy(), I also had a quick look if we could utilize it in Orquesta, but I found some blockers and Orquesta wasn't my priority back then so I didn't spend more time digging in.
IIRC, it was something related to complex types used by the state stuff since fast_deepcopy() only supports simple types aka no object instances.
If we can get those sorted out without any issues and the numbers indeed indicate performance improvements, then it would be a great change :)
On a related note to my comment above^ - I think it would be good to add a feature flag / config option for this functionality.
This will serve multiple purposes:
- We will be able to use feature flag to benchmark both scenarios - regular deepcopy vs fast_deepcopy
- In case user encounters issues with
fast_deepcopy()which we didn't catch here, they can easily opt-out of this functionality and continue using StackStorm without needing to downgrade or similar.
We won't be able to add a feature flag without a change in ST2. Orquesta does not have the Stackstorm config object afaik. I will put some benchmarks in the PR and also make sure we double check that errors is also copied.
I will put some benchmarks in the PR and also make sure we double check that errors is also copied.
Thanks.
We won't be able to add a feature flag without a change in ST2. Orquesta does not have the Stackstorm config object afaik.
Yeah, I assume it would probably also require a change on the StackStorm side. If you have cycles, I still think that would be a good idea.
Still much easier to flip the feature flag / config option in case something goes wrong versus needing to downgrade to a previous release.
Especially if it's combined with a breaking backward incompatible change in StackStorm core (live action FK change - there is only a forward migration in that PR, but not a backward one).
I've been looking into some slowness for workflows with a large amount of items inside of a with items step, I saw this PR and I'm curious what items are we actually looking to have gated by a feature flag in this PR?
task render speed gain when removing deep copy for large parameters
I added a micro benchmark to show how much faster it is to NOT do a deep copy. I could add it for the serialize and deserialize, but the results will be the same. Copying large objects is costly.
all the 3.9 tests passed with these updates minus the lint checks). rpm is here if @rebrowning wants to test it out.
I am running this additional speed up on our internal fork of St2.
I will report back if I see any issues.
It looks like ci/cd failed because black changed with their latest release.
@Kami @guzzijones are we requiring the feature flag to switch between deepcopy and fastdeepcopy for this to move forward?
Note, this will help with "with items" speed issues.
@nzlosh @cognifloyd please take a peak at this. I adds significant speed improvements by eliminating some deep copies. This helps with larger context objects.
Anyone willing to give some feedback here?
Technically I'm OK with this to be merged. As discussed mid 2024, it should be merged as part of st2 3.10. I think we should leave a release between the pymongo updates and the deepcopy change to avoid confusing any issues that may come up in the 3.9 and 3.10 release.