gossamer
gossamer copied to clipboard
(lib/runtime): fail to execute block `17088245` allocator out of space
2024-02-24T03:55:37Z INFO imported block 0x2aad31837b6cdf8ce5964cca86b3a39b67ff872aa5eb6c954217872f6683d63d and stored state trie with root 0xe40e9900f63d1539014ce5cc4fa0e4932ae4ed1154c31cd77714d0fdf2d4e8b3 pkg=core
RUNTIME HASH: 0xd1d708433e5d7bcb9fc30a3de67952c1611fa58cd1980146ab07aa438116d352
2024-02-24T03:55:37Z ERROR block data processing for block with hash 0x0feb9c8bb9018bbcf20b467e15636de02397de00c66233c5b4a06f5be3a2f558 failed: processing block data with header and body: handling block: failed to execute block 17088245: running runtime function: bumping: allocator out of space: current pages 65535 greater than max wasm pages 65535 (recovered by wazero)
wasm stack trace:
env.ext_allocator_malloc_version_1(i32) i32
._ZN5sp_io9allocator26extern_host_function_impls6malloc17h175ebffadfcb2d44E(i32) i32
._ZN5alloc7raw_vec11finish_grow17h97bec91bef32a5ebE.llvm.1609301459894010017(i32,i32,i32,i32)
._ZN5alloc7raw_vec19RawVec$LT$T$C$A$GT$16reserve_for_push17h1ed9c357b734a946E(i32,i32)
._ZN18parity_scale_codec5codec19decode_vec_with_len17h15e2f12c18cd243fE(i32,i32,i32)
._ZN13frame_support7storage8unhashed3get17hb6cef7316d8fee28E(i32,i32,i32)
._ZN353_$LT$pallet_staking..ExposureOf$LT$T$GT$$u20$as$u20$sp_runtime..traits..Convert$LT$$LT$T$u20$as$u20$frame_system..pallet..Config$GT$..AccountId$C$core..option..Option$LT$pallet_staking..Exposure$LT$$LT$T$u20$as$u20$frame_system..pallet..Config$GT$..AccountId$C$$LT$T$u20$as$u20$pallet_staking..pallet..pallet..Config$GT$..CurrencyBalance$GT$$GT$$GT$$GT$7convert17h469fc883609fa8e1E(i32,i32)
._ZN5alloc3vec16in_place_collect108_$LT$impl$u20$alloc..vec..spec_from_iter..SpecFromIter$LT$T$C$I$GT$$u20$for$u20$alloc..vec..Vec$LT$T$GT$$GT$9from_iter17h34924d7a99a1d609E(i32,i32)
._ZN14pallet_session10historical7onchain39store_session_validator_set_to_offchain17h0e99a24533159ab9E(i32)
._ZN893_$LT$$LP$TupleElement0$C$TupleElement1$C$TupleElement2$C$TupleElement3$C$TupleElement4$C$TupleElement5$C$TupleElement6$C$TupleElement7$C$TupleElement8$C$TupleElement9$C$TupleElement10$C$TupleElement11$C$TupleElement12$C$TupleElement13$C$TupleElement14$C$TupleElement15$C$TupleElement16$C$TupleElement17$C$TupleElement18$C$TupleElement19$C$TupleElement20$C$TupleElement21$C$TupleElement22$C$TupleElement23$C$TupleElement24$C$TupleElement25$C$TupleElement26$C$TupleElement27$C$TupleElement28$C$TupleElement29$C$TupleElement30$C$TupleElement31$C$TupleElement32$C$TupleElement33$C$TupleElement34$C$TupleElement35$C$TupleElement36$C$TupleElement37$C$TupleElement38$C$TupleElement39$C$TupleElement40$C$TupleElement41$C$TupleElement42$C$TupleElement43$C$TupleElement44$C$TupleElement45$C$TupleElement46$C$TupleElement47$RP$$u20$as$u20$frame_support..traits..hooks..OnInitialize$LT$BlockNumber$GT$$GT$13on_initialize17h252b62108cdb4ff8E(i32,i32)
._ZN15frame_executive104Executive$LT$System$C$Block$C$Context$C$UnsignedValidator$C$AllPalletsWithSystem$C$COnRuntimeUpgrade$GT$16initialize_block17h25c01251da9ce46aE(i32)
._ZN15frame_executive104Executive$LT$System$C$Block$C$Context$C$UnsignedValidator$C$AllPalletsWithSystem$C$COnRuntimeUpgrade$GT$13execute_block17h04bd9202b85e1554E(i32)
.Core_execute_block(i32,i32) i64 pkg=sync
edit After a long debug session with @jimjbrettj™ we found that the allocation problem was related with long running instances and that is not trivial to clean up the runtime memory after a block execution (manually clean up), so here is a suggestion of what can be done to solve and a reference to substrate (https://github.com/paritytech/polkadot-sdk/blob/0c6c837f689a287583508506e342ba07687e8d26/substrate/client/executor/wasmtime/src/runtime.rs#L170C39-L170C55)
- [ ] Instantiate a runtime for every block execution
Related to the allocator problem I will need to wait for EC2 node to run towards block 17M to validate the changes to the allocator, I cannot reproduce it locally since when I start the node the memory is completely empty when executing the block, differently from the ec2 node where the allocator/memory is being used every imported block. Given that the allocator exports its metrics through Prometheus we can compare the usage now and before (where Gossamer had less one page)
After investigating substrate FreeingBumpHeapAllocator
I notice that Gossamer is not cleaning up the allocator after block execution, instead Gossamer is refreshing the allocator only on every runtime upgrade
Here is the call_impl
method which is called when the host needs to call a runtime func
- https://github.com/paritytech/polkadot-sdk/blob/0c6c837f689a287583508506e342ba07687e8d26/substrate/client/executor/wasmtime/src/runtime.rs#L163C5-L163C14
- https://github.com/paritytech/polkadot-sdk/blob/0c6c837f689a287583508506e342ba07687e8d26/substrate/client/executor/wasmtime/src/runtime.rs#L170C39-L170C55
Also the PR https://github.com/ChainSafe/gossamer/pull/3800 introduces the fix
After investigations with @jimjbrettj™ we found that the allocator still facing the same problem, just reinstantiating the allocator is not enough to solve the allocator out of space
problem and we are facing wasm out of bounds access
problems as shown in the image belown.
Same problem happening on westend staging at block #9412261
While syncing to #17m we faced a super slowdown after a runtime upgrade (0.9.25) which I am investigating at https://github.com/ChainSafe/gossamer/issues/3950
The fix was confirmed in our staging nodes given that the latest block finalized was #17272320
fyi @jimjbrettj @P1sar @timwu20