aptos-core [Executor] Merge sequential & parallel execution flow, refactor, test

De-spagettify the sequential executor flow, merge with parallel executor crate, re-using as much code as possible, but keep the same algorithm for redundancy / testing & fallback (i.e. Block-STM could also run sequential with one thread with probably not too much overhead). The main benefit is that with generic parameters, we can now use the powerful testing framework we have for parallel executor for sequential execution also. We also will re-use the executor rayon threadpool for block signature checks.

Additional improvements to tests: - Test Over / Underflows based on storage values (close to 0 or close to u128::MAX) - test different number of cores - Refactors: ModulePath enum, isolate BaselineState for generating baseline with different configurations (i.e. aggregator values materialized or not), check delta sequence, etc. - Make sure StorageError precedence over DeltaApplicationFailure in speculative executions when aggregator is deleted but deltas on top also fail. Not relevant for current use-case (no under/over flows and deletes), also also the proper algorithm for handling general case may not have the issue anyway.

@perryjrandall now would be a great time to start running on different platforms too.

This change is

Sep 30 '22 18:09 gelash

just a quick comment as I am reviewing this, and it will be slow for me as it touches a lot of core stuff I have not seen before. Though a wonderful exercise, thanks! I am not sure when we cut the branch (if we did not already) and all of the main-net implications, but this is touching a bunch of core stuff, as far as I can tell and we want it in definitely after we are done with main-net and all of that, right? Can you share your thoughts?

Also please review errors which look legit?

Oct 01 '22 14:10 dariorussi

@dariorussi - yes, if we like it, we should roll this in after main-net. Should be nothing breaking, more tests, common flow and should facilitate lots of future improvements (e.g. even just the virtue of having a single executor object for callbacks or what not).

Will def fix all the linter and related errors. And (TODO to self): will also experiment with Dashset/Dashmap configuration settings (num shards mainly), we use defaults and maybe there is some perf to squeeze.

Oct 04 '22 04:10 gelash

I think you are splitting this up in multiple PRs which is a very good idea, so love it and maybe you want to mark it somehow. Just commenting given I have a small comment I think you may enjoy.

Will do ASAP. To document here, the plan is to do proptest changes in a separate diff.

I will make sure to stack the proptest PR on top though and not land the first PR without ensuring the second one (with new and stronger tests) passes.

Nov 18 '22 08:11 gelash

Split the PR, this is the first one that refactors the aptos-vm execution flow, removing unused status and merging sequential & parallel execution. Renamed parallel_executor -> block_executor as per @dariorussi 's suggestion, but since that's a lot of lines, made it a separate commit for the ease of reviewing.

@runtian-zhou can you have a look now?

Nov 19 '22 22:11 gelash

@zekun000 @sasha8 ping ping

Nov 19 '22 22:11 gelash

Forge is running suite `land_blocking` on `82968ffdf6e772b7fa26eae01770662473773193`

Nov 21 '22 01:11 github-actions[bot]

Forge is running suite `compat` on `testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b` ==> `82968ffdf6e772b7fa26eae01770662473773193`

Nov 21 '22 01:11 github-actions[bot]

:white_check_mark: Forge suite `compat` success on `testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b` ==> `82968ffdf6e772b7fa26eae01770662473773193`

Compatibility test results for testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> 82968ffdf6e772b7fa26eae01770662473773193 (PR)
1. Check liveness of validators at old version: testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b
compatibility::simple-validator-upgrade::liveness-check : 7502 TPS, 5526 ms latency, 7800 ms p99 latency,no expired txns
2. Upgrading first Validator to new version: 82968ffdf6e772b7fa26eae01770662473773193
compatibility::simple-validator-upgrade::single-validator-upgrade : 4745 TPS, 8629 ms latency, 12300 ms p99 latency,no expired txns
3. Upgrading rest of first batch to new version: 82968ffdf6e772b7fa26eae01770662473773193
compatibility::simple-validator-upgrade::half-validator-upgrade : 4683 TPS, 8744 ms latency, 11500 ms p99 latency,no expired txns
4. upgrading second batch to new version: 82968ffdf6e772b7fa26eae01770662473773193
compatibility::simple-validator-upgrade::rest-validator-upgrade : 6382 TPS, 6197 ms latency, 10100 ms p99 latency,no expired txns
5. check swarm health
Compatibility test for testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> 82968ffdf6e772b7fa26eae01770662473773193 passed
Test Ok

Grafana dashboard
Humio Logs
Test runner output
Test run is land-blocking

Nov 21 '22 01:11 github-actions[bot]

:white_check_mark: Forge suite `land_blocking` success on `82968ffdf6e772b7fa26eae01770662473773193`

performance benchmark with full nodes : 6607 TPS, 6016 ms latency, 20000 ms p99 latency,no expired txns
Test Ok

Grafana dashboard
Humio Logs
Test runner output
Test run is land-blocking

Nov 21 '22 01:11 github-actions[bot]

Forge is running suite `land_blocking` on `d22cc15ba0fa0f3940af008a3d47781803ee266d`

Nov 23 '22 05:11 github-actions[bot]

Forge is running suite `compat` on `testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b` ==> `d22cc15ba0fa0f3940af008a3d47781803ee266d`

Nov 23 '22 05:11 github-actions[bot]

:white_check_mark: Forge suite `compat` success on `testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b` ==> `d22cc15ba0fa0f3940af008a3d47781803ee266d`

Compatibility test results for testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> d22cc15ba0fa0f3940af008a3d47781803ee266d (PR)
1. Check liveness of validators at old version: testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b
compatibility::simple-validator-upgrade::liveness-check : 7446 TPS, 5210 ms latency, 7500 ms p99 latency,no expired txns
2. Upgrading first Validator to new version: d22cc15ba0fa0f3940af008a3d47781803ee266d
compatibility::simple-validator-upgrade::single-validator-upgrade : 4293 TPS, 9658 ms latency, 13000 ms p99 latency,no expired txns
3. Upgrading rest of first batch to new version: d22cc15ba0fa0f3940af008a3d47781803ee266d
compatibility::simple-validator-upgrade::half-validator-upgrade : 4326 TPS, 9683 ms latency, 13200 ms p99 latency,no expired txns
4. upgrading second batch to new version: d22cc15ba0fa0f3940af008a3d47781803ee266d
compatibility::simple-validator-upgrade::rest-validator-upgrade : 6646 TPS, 5851 ms latency, 11400 ms p99 latency,no expired txns
5. check swarm health
Compatibility test for testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> d22cc15ba0fa0f3940af008a3d47781803ee266d passed
Test Ok

Grafana dashboard
Humio Logs
Test runner output
Test run is land-blocking

Nov 23 '22 06:11 github-actions[bot]

:white_check_mark: Forge suite `land_blocking` success on `d22cc15ba0fa0f3940af008a3d47781803ee266d`

performance benchmark with full nodes : 6515 TPS, 6120 ms latency, 19900 ms p99 latency,(!) expired 320 out of 2782260 txns
Test Ok

Grafana dashboard
Humio Logs
Test runner output
Test run is land-blocking

Nov 23 '22 06:11 github-actions[bot]

Forge is running suite `compat` on `testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b` ==> `73bf1812118d3a3960e9441f22b6999a272af7eb`

Nov 26 '22 05:11 github-actions[bot]

Forge is running suite `land_blocking` on `73bf1812118d3a3960e9441f22b6999a272af7eb`

Nov 26 '22 05:11 github-actions[bot]

:white_check_mark: Forge suite `compat` success on `testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b` ==> `73bf1812118d3a3960e9441f22b6999a272af7eb`

Compatibility test results for testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> 73bf1812118d3a3960e9441f22b6999a272af7eb (PR)
1. Check liveness of validators at old version: testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b
compatibility::simple-validator-upgrade::liveness-check : 7480 TPS, 5187 ms latency, 7500 ms p99 latency,no expired txns
2. Upgrading first Validator to new version: 73bf1812118d3a3960e9441f22b6999a272af7eb
compatibility::simple-validator-upgrade::single-validator-upgrade : 4973 TPS, 8341 ms latency, 10600 ms p99 latency,no expired txns
3. Upgrading rest of first batch to new version: 73bf1812118d3a3960e9441f22b6999a272af7eb
compatibility::simple-validator-upgrade::half-validator-upgrade : 4438 TPS, 9066 ms latency, 12400 ms p99 latency,no expired txns
4. upgrading second batch to new version: 73bf1812118d3a3960e9441f22b6999a272af7eb
compatibility::simple-validator-upgrade::rest-validator-upgrade : 6808 TPS, 5629 ms latency, 9100 ms p99 latency,no expired txns
5. check swarm health
Compatibility test for testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> 73bf1812118d3a3960e9441f22b6999a272af7eb passed
Test Ok

Grafana dashboard
Humio Logs
Test runner output
Test run is land-blocking

Nov 26 '22 05:11 github-actions[bot]

:white_check_mark: Forge suite `land_blocking` success on `73bf1812118d3a3960e9441f22b6999a272af7eb`

performance benchmark with full nodes : 6657 TPS, 5956 ms latency, 10200 ms p99 latency,(!) expired 200 out of 2842940 txns
Test Ok

Grafana dashboard
Humio Logs
Test runner output
Test run is land-blocking

Nov 26 '22 05:11 github-actions[bot]

Lets give some more time for review from Move team members. (I know you filed this a while ago so sorry for no earlier attention.)

Nov 26 '22 08:11 wrwg

Lets give some more time for review from Move team members. (I know you filed this a while ago so sorry for no earlier attention.)

I was never going to land this without @runtian-zhou 's approval, but appreciate more eyes.

For context: I already have 3 follow-ups implemented (2 drafts linked 2 in the comments, or can see here: https://github.com/aptos-labs/aptos-core/compare/main...seqparoutput, third for adding proptests which I separated from here for clarity) which intend to improve and simplify the whole executor <-> aptos-vm integration (+ sequential now), delta resolution, etc.

But it all starts here, and once we have all executor flow nicely in the block_executor, opens up all the extension and gas hooks that we need for different outstanding tasks (can explain offline).

Nov 26 '22 14:11 gelash

This generally seems to go into the right direction (that is unifying sequential/parallel block execution). We need a larger refactoring of the adapter (see comment below), but this PR can be an incremental step for this. Leaving detail review to folks more acquainted with PE.

Totally agreed, that's precisely the intention. There are some specific fixes, but regarding the overall refactoring aspect, my rationale at the moment is to try and make some incremental local simplifications / improvements while hopefully going in the good global direction (allow having a single "executor" that can be cleanly connected to other parts of aptos-vm, re-used across blocks, etc). Hopefully with some iteration (and there are indeed 2/3 PRs coming right after), we will also start seeing the big picture better. One heuristic for me for now is to push some pieces of logic to the block_executor side (e.g. in the next PR, the StateView wrapper business currently done in different ways in aptos-vm), since that code is newer and should have more fixed structure / flow.

However, then we should absolutely look into precisely the question of how the block executor is incorporated into aptos_vm. For some context, I believe the current state of affairs is due to two things - @runtian-zhou can confirm or deny as the author and the ultimate expert on the wrapper flow: (a) make parallel executor standalone crate with generic parameters - be able to test without Move-VM. (b) a clear separator of abstractions to facilitate the development at the time. The testing (and prop-testing) framework is probably the best thing that we got out of building it this way. But now all these layers in aptos-vm / aptos_vm_impl / adapter / wrapper need to eventually also be simplified, especially since we have a lot of use-cases where we'd need hooks to & from other parts of aptos-vm.

Nov 28 '22 06:11 gelash

Forge is running suite `land_blocking` on `e197a64f990b349b888eec4624c1adf945d0ef67`

Nov 28 '22 07:11 github-actions[bot]

Forge is running suite `compat` on `testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b` ==> `e197a64f990b349b888eec4624c1adf945d0ef67`

Nov 28 '22 07:11 github-actions[bot]

:white_check_mark: Forge suite `land_blocking` success on `e197a64f990b349b888eec4624c1adf945d0ef67`

performance benchmark with full nodes : 6654 TPS, 5977 ms latency, 9600 ms p99 latency,(!) expired 80 out of 2841680 txns
Test Ok

Grafana dashboard
Humio Logs
Test runner output
Test run is land-blocking

Nov 28 '22 08:11 github-actions[bot]

:white_check_mark: Forge suite `compat` success on `testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b` ==> `e197a64f990b349b888eec4624c1adf945d0ef67`

Compatibility test results for testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> e197a64f990b349b888eec4624c1adf945d0ef67 (PR)
1. Check liveness of validators at old version: testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b
compatibility::simple-validator-upgrade::liveness-check : 7062 TPS, 5501 ms latency, 7200 ms p99 latency,no expired txns
2. Upgrading first Validator to new version: e197a64f990b349b888eec4624c1adf945d0ef67
compatibility::simple-validator-upgrade::single-validator-upgrade : 4655 TPS, 8629 ms latency, 11700 ms p99 latency,no expired txns
3. Upgrading rest of first batch to new version: e197a64f990b349b888eec4624c1adf945d0ef67
compatibility::simple-validator-upgrade::half-validator-upgrade : 4600 TPS, 8860 ms latency, 12800 ms p99 latency,no expired txns
4. upgrading second batch to new version: e197a64f990b349b888eec4624c1adf945d0ef67
compatibility::simple-validator-upgrade::rest-validator-upgrade : 6676 TPS, 5958 ms latency, 10300 ms p99 latency,no expired txns
5. check swarm health
Compatibility test for testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> e197a64f990b349b888eec4624c1adf945d0ef67 passed
Test Ok

Grafana dashboard
Humio Logs
Test runner output
Test run is land-blocking

Nov 28 '22 08:11 github-actions[bot]

The way how parallel executor is structured like this is exactly what @gelash suggests. The whole intention of this parallel_executor crate abstraction is to:

Help testing core scheduling logic without worrying about the Aptos VM implementation.
Being able to run benchmarks that are independent of the AptosVM.

The testing aspect is probably more important in this context cuz it's generally quite hard to test parallel code like this.

Nov 28 '22 23:11 runtian-zhou

Forge is running suite `compat` on `testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b` ==> `c419c2153ba3336fac166f9b1184ec28869179d5`

Nov 29 '22 01:11 github-actions[bot]

Forge is running suite `land_blocking` on `c419c2153ba3336fac166f9b1184ec28869179d5`

Nov 29 '22 01:11 github-actions[bot]

:white_check_mark: Forge suite `compat` success on `testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b` ==> `c419c2153ba3336fac166f9b1184ec28869179d5`

Compatibility test results for testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> c419c2153ba3336fac166f9b1184ec28869179d5 (PR)
1. Check liveness of validators at old version: testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b
compatibility::simple-validator-upgrade::liveness-check : 7405 TPS, 5244 ms latency, 7000 ms p99 latency,no expired txns
2. Upgrading first Validator to new version: c419c2153ba3336fac166f9b1184ec28869179d5
compatibility::simple-validator-upgrade::single-validator-upgrade : 4806 TPS, 8401 ms latency, 12200 ms p99 latency,no expired txns
3. Upgrading rest of first batch to new version: c419c2153ba3336fac166f9b1184ec28869179d5
compatibility::simple-validator-upgrade::half-validator-upgrade : 4763 TPS, 8435 ms latency, 11000 ms p99 latency,no expired txns
4. upgrading second batch to new version: c419c2153ba3336fac166f9b1184ec28869179d5
compatibility::simple-validator-upgrade::rest-validator-upgrade : 6904 TPS, 5806 ms latency, 11100 ms p99 latency,no expired txns
5. check swarm health
Compatibility test for testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> c419c2153ba3336fac166f9b1184ec28869179d5 passed
Test Ok

Grafana dashboard
Humio Logs
Test runner output
Test run is land-blocking

Nov 29 '22 01:11 github-actions[bot]

:white_check_mark: Forge suite `land_blocking` success on `c419c2153ba3336fac166f9b1184ec28869179d5`

performance benchmark with full nodes : 6943 TPS, 5706 ms latency, 8700 ms p99 latency,(!) expired 540 out of 2965300 txns
Test Ok

Grafana dashboard
Humio Logs
Test runner output
Test run is land-blocking

Nov 29 '22 01:11 github-actions[bot]

aptos-core aptos-core copied to clipboard

[Executor] Merge sequential & parallel execution flow, refactor, test

Forge is running suite land_blocking on 82968ffdf6e772b7fa26eae01770662473773193

Forge is running suite compat on testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> 82968ffdf6e772b7fa26eae01770662473773193

:white_check_mark: Forge suite compat success on testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> 82968ffdf6e772b7fa26eae01770662473773193

:white_check_mark: Forge suite land_blocking success on 82968ffdf6e772b7fa26eae01770662473773193

Forge is running suite land_blocking on d22cc15ba0fa0f3940af008a3d47781803ee266d

Forge is running suite compat on testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> d22cc15ba0fa0f3940af008a3d47781803ee266d

:white_check_mark: Forge suite compat success on testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> d22cc15ba0fa0f3940af008a3d47781803ee266d

:white_check_mark: Forge suite land_blocking success on d22cc15ba0fa0f3940af008a3d47781803ee266d

Forge is running suite compat on testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> 73bf1812118d3a3960e9441f22b6999a272af7eb

Forge is running suite land_blocking on 73bf1812118d3a3960e9441f22b6999a272af7eb

:white_check_mark: Forge suite compat success on testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> 73bf1812118d3a3960e9441f22b6999a272af7eb

:white_check_mark: Forge suite land_blocking success on 73bf1812118d3a3960e9441f22b6999a272af7eb

Forge is running suite land_blocking on e197a64f990b349b888eec4624c1adf945d0ef67

Forge is running suite compat on testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> e197a64f990b349b888eec4624c1adf945d0ef67

:white_check_mark: Forge suite land_blocking success on e197a64f990b349b888eec4624c1adf945d0ef67

:white_check_mark: Forge suite compat success on testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> e197a64f990b349b888eec4624c1adf945d0ef67

Forge is running suite compat on testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> c419c2153ba3336fac166f9b1184ec28869179d5

Forge is running suite land_blocking on c419c2153ba3336fac166f9b1184ec28869179d5

:white_check_mark: Forge suite compat success on testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> c419c2153ba3336fac166f9b1184ec28869179d5

:white_check_mark: Forge suite land_blocking success on c419c2153ba3336fac166f9b1184ec28869179d5

aptos-core
aptos-core copied to clipboard

Forge is running suite `land_blocking` on `82968ffdf6e772b7fa26eae01770662473773193`

Forge is running suite `compat` on `testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b` ==> `82968ffdf6e772b7fa26eae01770662473773193`

:white_check_mark: Forge suite `compat` success on `testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b` ==> `82968ffdf6e772b7fa26eae01770662473773193`

:white_check_mark: Forge suite `land_blocking` success on `82968ffdf6e772b7fa26eae01770662473773193`

Forge is running suite `land_blocking` on `d22cc15ba0fa0f3940af008a3d47781803ee266d`

Forge is running suite `compat` on `testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b` ==> `d22cc15ba0fa0f3940af008a3d47781803ee266d`

:white_check_mark: Forge suite `compat` success on `testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b` ==> `d22cc15ba0fa0f3940af008a3d47781803ee266d`

:white_check_mark: Forge suite `land_blocking` success on `d22cc15ba0fa0f3940af008a3d47781803ee266d`

Forge is running suite `compat` on `testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b` ==> `73bf1812118d3a3960e9441f22b6999a272af7eb`

Forge is running suite `land_blocking` on `73bf1812118d3a3960e9441f22b6999a272af7eb`

:white_check_mark: Forge suite `compat` success on `testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b` ==> `73bf1812118d3a3960e9441f22b6999a272af7eb`

:white_check_mark: Forge suite `land_blocking` success on `73bf1812118d3a3960e9441f22b6999a272af7eb`

Forge is running suite `land_blocking` on `e197a64f990b349b888eec4624c1adf945d0ef67`

Forge is running suite `compat` on `testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b` ==> `e197a64f990b349b888eec4624c1adf945d0ef67`

:white_check_mark: Forge suite `land_blocking` success on `e197a64f990b349b888eec4624c1adf945d0ef67`

:white_check_mark: Forge suite `compat` success on `testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b` ==> `e197a64f990b349b888eec4624c1adf945d0ef67`

Forge is running suite `compat` on `testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b` ==> `c419c2153ba3336fac166f9b1184ec28869179d5`

Forge is running suite `land_blocking` on `c419c2153ba3336fac166f9b1184ec28869179d5`

:white_check_mark: Forge suite `compat` success on `testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b` ==> `c419c2153ba3336fac166f9b1184ec28869179d5`

:white_check_mark: Forge suite `land_blocking` success on `c419c2153ba3336fac166f9b1184ec28869179d5`