nearcore
nearcore copied to clipboard
Exclude contract code out of state witness & distribute separately
Relevant discussion
Issue
During stateless validation forknet test, we observed node crash with the following error
2024-04-16T20:21:23.545144Z DEBUG chunk_tracing{chunk_hash=HnFSQEoLMEnMXK2pxnnnbv7GkwFobanyrd7JJbNS2Rrj}:new_chunk{shard_id=3}:apply_chunk{shard_id=3}:process_state_update:apply{protocol_version=84 num_transactions=19}:process_receipt{receipt_id=GHhLncT5GM2ksuwVzUqPMkzCp132V7xToQZPfUbKeRgP predecessor=operator.meta-pool.near receiver=lockup-meta-pool.near id=GHhLncT5GM2ksuwVzUqPMkzCp132V7xToQZPfUbKeRgP}:run{code.hash=EXekfV3kpFHHsTi4JUDh2MVLCKS3hpKdPbXMuRirxrvY vm_kind=NearVm}: vm: close time.busy=49.3µs time.idle=3.42µs
thread '<unnamed>' panicked at core/store/src/trie/trie_storage.rs:317:16:
!!!CRASH!!!: MissingTrieValue(TrieMemoryPartialStorage, 5FWvfWAJxH1mbCHuzLGwBfL9EYjH8YWVin6Pmp3H8gdM)
stack backtrace:
0: rust_begin_unwind
1: core::panicking::panic_fmt
2: core::result::unwrap_failed
3: <near_store::trie::trie_storage::TrieMemoryPartialStorage as near_store::trie::trie_storage::TrieStorage>::retrieve_raw_bytes
4: near_store::trie::Trie::internal_retrieve_trie_node
5: near_store::trie::Trie::retrieve_raw_node
6: near_store::trie::Trie::lookup_from_state_column
7: near_store::trie::Trie::get_optimized_ref
8: near_store::trie::Trie::get
9: near_store::trie::update::TrieUpdate::get
10: near_store::get_code
11: node_runtime::actions::execute_function_call
12: node_runtime::Runtime::apply_action
13: node_runtime::Runtime::apply_action_receipt
14: node_runtime::Runtime::apply::{{closure}}
15: node_runtime::Runtime::apply
16: <near_chain::runtime::NightshadeRuntime as near_chain::types::RuntimeAdapter>::apply_chunk
17: near_chain::update_shard::apply_new_chunk
18: core::ops::function::FnOnce::call_once{{vtable.shim}}
19: <rayon_core::job::HeapJob<BODY> as rayon_core::job::Job>::execute
20: rayon_core::registry::WorkerThread::wait_until_cold
@Longarithm mentioned that
10: near_store::get_code
is due to missing contract doe from state witness.
From debug log, @staffik confirmed that it was likely the case and the crash was happening with different contracts, including lockup-meta-pool.near
and pack.promotional.basketball.playible.near
@Longarithm 's understanding of how this can cause node crash is as follows:
- Chunk producer reads code from cache and doesn't go to trie for the code;
- so trie nodes required for reading contract code are never read and recorded;
- so chunk validator doesn't know where to take it.
Timeline
April 17
@Longarithm is preparing a quick patch to bypass the issue in Forknet for now, but we need a proper solution in place before MainNet launch
April 18
The team had discussion on the proper solution and concluded to separate contract out of state witness. When a chunk validator realizes that it does not have a contract code to validate incoming state witness, it will reactively request missing code to its peers. As a result, chunk miss may happen, but the chunk validator should be compiled contract code ready fur the future validation.
The project involves following works but not limited to:
- Introduce a new network message to request contract code
- Saketh's tip on how to do so: link
- Remove contract code from state witness