aptos-core icon indicating copy to clipboard operation
aptos-core copied to clipboard

[move] Fix struct depth computation concurrency bug

Open georgemitenkov opened this issue 1 year ago • 6 comments

Description

Recently, VM became shared per-thread (#13091) for validation. As a result, an invariant violation can be triggered on struct depth checks in loader. This PR fixes that by removing the race condition when writing the newly cached depth formula.

Note: removing the paranoid check for recursive types is safe because it is already checked by the bytecode verifier.

Type of Change

  • [ ] New feature
  • [x] Bug fix
  • [ ] Breaking change
  • [ ] Performance improvement
  • [ ] Refactoring
  • [ ] Dependency update
  • [ ] Documentation update
  • [ ] Tests

Which Components or Systems Does This Change Impact?

  • [ ] Validator Node
  • [ ] Full Node (API, Indexer, etc.)
  • [x] Move/Aptos Virtual Machine
  • [ ] Aptos Framework
  • [ ] Aptos CLI/SDK
  • [ ] Developer Infrastructure
  • [ ] Other (specify)

How Has This Been Tested?

Key Areas to Review

Checklist

  • [x] I have read and followed the CONTRIBUTING doc
  • [x] I have performed a self-review of my own code
  • [ ] I have commented my code, particularly in hard-to-understand areas
  • [x] I identified and added all stakeholders and component owners affected by this change as reviewers
  • [ ] I tested both happy and unhappy path of the functionality
  • [ ] I have made corresponding changes to the documentation

georgemitenkov avatar Apr 29 '24 09:04 georgemitenkov

Forge is running suite realistic_env_max_load on dde9c9f5bf92e4ee8008954ce117b518764ac83b

github-actions[bot] avatar Apr 29 '24 09:04 github-actions[bot]

Forge is running suite compat on 01b24e7e3548382dd25440b39a0438a993387f12 ==> dde9c9f5bf92e4ee8008954ce117b518764ac83b

github-actions[bot] avatar Apr 29 '24 09:04 github-actions[bot]

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 62.2%. Comparing base (4ba801d) to head (dde9c9f). Report is 772 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff            @@
##             main   #13097     +/-   ##
=========================================
- Coverage    62.4%    62.2%   -0.2%     
=========================================
  Files         830      827      -3     
  Lines      186498   182950   -3548     
=========================================
- Hits       116389   113831   -2558     
+ Misses      70109    69119    -990     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Apr 29 '24 09:04 codecov[bot]

:x: Forge suite compat failure on 01b24e7e3548382dd25440b39a0438a993387f12 ==> dde9c9f5bf92e4ee8008954ce117b518764ac83b

Compatibility test results for 01b24e7e3548382dd25440b39a0438a993387f12 ==> dde9c9f5bf92e4ee8008954ce117b518764ac83b (PR)
1. Check liveness of validators at old version: 01b24e7e3548382dd25440b39a0438a993387f12
compatibility::simple-validator-upgrade::liveness-check : committed: 6300 txn/s, latency: 5067 ms, (p50: 4800 ms, p90: 9200 ms, p99: 10500 ms), latency samples: 239400
2. Upgrading first Validator to new version: dde9c9f5bf92e4ee8008954ce117b518764ac83b
Test Failed: Tried executing 10 txns, request counters: "success 0, failed submit [10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10], failed wait [10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10], by client: [(0, 180, 180): http://aptos-node-1-validator.forge-compat-pr-13097.svc:8080/v1/]"

Caused by:
    Unknown error Transaction expired, without being seen in mempool. It is guaranteed it will not be committed on chain.

Stack backtrace:
   0: anyhow::error::<impl core::convert::From<E> for anyhow::Error>::from
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/anyhow-1.0.79/src/error.rs:565:25
   1: <core::result::Result<T,F> as core::ops::try_trait::FromResidual<core::result::Result<core::convert::Infallible,E>>>::from_residual
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/result.rs:1963:27
   2: aptos_transaction_emitter_lib::emitter::transaction_executor::RestApiReliableTransactionSubmitter::submit_check_and_retry::{{closure}}
             at ./crates/transaction-emitter-lib/src/emitter/transaction_executor.rs:127:28
   3: <futures_util::future::maybe_done::MaybeDone<Fut> as core::future::future::Future>::poll
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-util-0.3.30/src/future/maybe_done.rs:95:38
   4: <futures_util::future::join_all::JoinAll<F> as core::future::future::Future>::poll
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-util-0.3.30/src/future/join_all.rs:143:24
   5: <aptos_transaction_emitter_lib::emitter::transaction_executor::RestApiReliableTransactionSubmitter as aptos_transaction_generator_lib::ReliableTransactionSubmitter>::execute_transactions_with_counter::{{closure}}
             at ./crates/transaction-emitter-lib/src/emitter/transaction_executor.rs:309:10
   6: <core::pin::Pin<P> as core::future::future::Future>::poll
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/future/future.rs:125:9
   7: aptos_transaction_emitter_lib::emitter::account_minter::AccountMinter::create_and_fund_seed_accounts::{{closure}}
             at ./crates/transaction-emitter-lib/src/emitter/account_minter.rs:434:18
   8: aptos_transaction_emitter_lib::emitter::account_minter::AccountMinter::create_and_fund_accounts::{{closure}}
             at ./crates/transaction-emitter-lib/src/emitter/account_minter.rs:326:14
   9: aptos_transaction_emitter_lib::emitter::create_accounts::{{closure}}
             at ./crates/transaction-emitter-lib/src/emitter/mod.rs:1195:14
  10: aptos_transaction_emitter_lib::emitter::TxnEmitter::start_job::{{closure}}
             at ./crates/transaction-emitter-lib/src/emitter/mod.rs:717:10
  11: aptos_transaction_emitter_lib::emitter::TxnEmitter::emit_txn_for_impl::{{closure}}
             at ./crates/transaction-emitter-lib/src/emitter/mod.rs:827:14
  12: aptos_transaction_emitter_lib::emitter::TxnEmitter::emit_txn_for::{{closure}}
             at ./crates/transaction-emitter-lib/src/emitter/mod.rs:859:14
  13: tokio::runtime::park::CachedParkThread::block_on::{{closure}}
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.35.1/src/runtime/park.rs:282:63
  14: tokio::runtime::coop::with_budget
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.35.1/src/runtime/coop.rs:107:5
  15: tokio::runtime::coop::budget
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.35.1/src/runtime/coop.rs:73:5
  16: tokio::runtime::park::CachedParkThread::block_on
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.35.1/src/runtime/park.rs:282:31
  17: tokio::runtime::context::blocking::BlockingRegionGuard::block_on
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.35.1/src/runtime/context/blocking.rs:66:9
  18: tokio::runtime::scheduler::multi_thread::MultiThread::block_on::{{closure}}
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.35.1/src/runtime/scheduler/multi_thread/mod.rs:87:13
  19: tokio::runtime::context::runtime::enter_runtime
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.35.1/src/runtime/context/runtime.rs:65:16
  20: tokio::runtime::scheduler::multi_thread::MultiThread::block_on
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.35.1/src/runtime/scheduler/multi_thread/mod.rs:86:9
  21: tokio::runtime::runtime::Runtime::block_on
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.35.1/src/runtime/runtime.rs:350:50
  22: aptos_testcases::generate_traffic
             at ./testsuite/testcases/src/lib.rs:105:17
  23: <aptos_testcases::compatibility_test::SimpleValidatorUpgrade as aptos_forge::interface::network::NetworkTest>::run
             at ./testsuite/testcases/src/compatibility_test.rs:83:24
  24: aptos_forge::runner::Forge<F>::run::{{closure}}
             at ./testsuite/forge/src/runner.rs:598:42
  25: aptos_forge::runner::run_test
             at ./testsuite/forge/src/runner.rs:666:11
  26: aptos_forge::runner::Forge<F>::run
             at ./testsuite/forge/src/runner.rs:598:30
  27: forge::run_forge
             at ./testsuite/forge-cli/src/main.rs:427:11
  28: forge::main
             at ./testsuite/forge-cli/src/main.rs:353:21
  29: core::ops::function::FnOnce::call_once
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/ops/function.rs:250:5
  30: std::sys_common::backtrace::__rust_begin_short_backtrace
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/sys_common/backtrace.rs:154:18
  31: std::rt::lang_start::{{closure}}
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/rt.rs:167:18
  32: core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/ops/function.rs:284:13
  33: std::panicking::try::do_call
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:552:40
  34: std::panicking::try
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:516:19
  35: std::panic::catch_unwind
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panic.rs:142:14
  36: std::rt::lang_start_internal::{{closure}}
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/rt.rs:148:48
  37: std::panicking::try::do_call
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:552:40
  38: std::panicking::try
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:516:19
  39: std::panic::catch_unwind
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panic.rs:142:14
  40: std::rt::lang_start_internal
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/rt.rs:148:20
  41: std::rt::lang_start
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/rt.rs:166:17
  42: __libc_start_main
  43: _start
Trailing Log Lines:
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:516:19
  39: std::panic::catch_unwind
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panic.rs:142:14
  40: std::rt::lang_start_internal
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/rt.rs:148:20
  41: std::rt::lang_start
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/rt.rs:166:17
  42: __libc_start_main
  43: _start


Swarm logs can be found here: See fgi output for more information.
{"level":"INFO","source":{"package":"aptos_forge","file":"testsuite/forge/src/backend/k8s/cluster_helper.rs:292"},"thread_name":"main","hostname":"forge-compat-pr-13097-1714384034-01b24e7e3548382dd25440b39a0438","timestamp":"2024-04-29T09:56:43.349378Z","message":"Deleting namespace forge-compat-pr-13097: Some(NamespaceStatus { conditions: None, phase: Some(\"Terminating\") })"}
{"level":"INFO","source":{"package":"aptos_forge","file":"testsuite/forge/src/backend/k8s/cluster_helper.rs:400"},"thread_name":"main","hostname":"forge-compat-pr-13097-1714384034-01b24e7e3548382dd25440b39a0438","timestamp":"2024-04-29T09:56:43.349406Z","message":"aptos-node resources for Forge removed in namespace: forge-compat-pr-13097"}

failures:
    compatibility::simple-validator-upgrade

test result: FAILED. 0 passed; 1 failed; 0 filtered out

Failed to run tests:
Tests Failed
Error: Tests Failed

Stack backtrace:
   0: anyhow::error::<impl anyhow::Error>::msg
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/anyhow-1.0.79/src/error.rs:83:36
   1: aptos_forge::runner::Forge<F>::run
             at ./testsuite/forge/src/runner.rs:618:13
   2: forge::run_forge
             at ./testsuite/forge-cli/src/main.rs:427:11
   3: forge::main
             at ./testsuite/forge-cli/src/main.rs:353:21
   4: core::ops::function::FnOnce::call_once
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/ops/function.rs:250:5
   5: std::sys_common::backtrace::__rust_begin_short_backtrace
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/sys_common/backtrace.rs:154:18
   6: std::rt::lang_start::{{closure}}
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/rt.rs:167:18
   7: core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/ops/function.rs:284:13
   8: std::panicking::try::do_call
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:552:40
   9: std::panicking::try
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:516:19
  10: std::panic::catch_unwind
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panic.rs:142:14
  11: std::rt::lang_start_internal::{{closure}}
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/rt.rs:148:48
  12: std::panicking::try::do_call
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:552:40
  13: std::panicking::try
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:516:19
  14: std::panic::catch_unwind
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panic.rs:142:14
  15: std::rt::lang_start_internal
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/rt.rs:148:20
  16: std::rt::lang_start
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/rt.rs:166:17
  17: __libc_start_main
  18: _start
Debugging output:
NAME                                    READY   STATUS      RESTARTS   AGE
aptos-node-0-validator-0                1/1     Running     0          6m46s
aptos-node-1-validator-0                1/1     Running     0          3m40s
aptos-node-2-validator-0                1/1     Running     0          6m46s
aptos-node-3-validator-0                1/1     Running     0          6m46s
genesis-aptos-genesis-eforge103-q482v   0/1     Completed   0          7m16s

github-actions[bot] avatar Apr 29 '24 09:04 github-actions[bot]

:white_check_mark: Forge suite realistic_env_max_load success on dde9c9f5bf92e4ee8008954ce117b518764ac83b

two traffics test: inner traffic : committed: 8368 txn/s, latency: 4691 ms, (p50: 4500 ms, p90: 5400 ms, p99: 9900 ms), latency samples: 3606880
two traffics test : committed: 100 txn/s, latency: 1890 ms, (p50: 1800 ms, p90: 2100 ms, p99: 6100 ms), latency samples: 1740
Latency breakdown for phase 0: ["QsBatchToPos: max: 0.208, avg: 0.202", "QsPosToProposal: max: 0.235, avg: 0.217", "ConsensusProposalToOrdered: max: 0.427, avg: 0.404", "ConsensusOrderedToCommit: max: 0.386, avg: 0.370", "ConsensusProposalToCommit: max: 0.785, avg: 0.773"]
Max round gap was 1 [limit 4] at version 1773779. Max no progress secs was 4.863372 [limit 15] at version 1773779.
Test Ok

github-actions[bot] avatar Apr 29 '24 09:04 github-actions[bot]

This issue is stale because it has been open 45 days with no activity. Remove the stale label, comment or push a commit - otherwise this will be closed in 15 days.

github-actions[bot] avatar Jun 14 '24 01:06 github-actions[bot]

This issue is stale because it has been open 45 days with no activity. Remove the stale label, comment or push a commit - otherwise this will be closed in 15 days.

github-actions[bot] avatar Jul 30 '24 01:07 github-actions[bot]

This issue is stale because it has been open 45 days with no activity. Remove the stale label, comment or push a commit - otherwise this will be closed in 15 days.

github-actions[bot] avatar Sep 14 '24 01:09 github-actions[bot]