gossamer icon indicating copy to clipboard operation
gossamer copied to clipboard

(lib/runtime): fail to execute block `17088245` allocator out of space

Open EclesioMeloJunior opened this issue 11 months ago • 4 comments

2024-02-24T03:55:37Z INFO     imported block 0x2aad31837b6cdf8ce5964cca86b3a39b67ff872aa5eb6c954217872f6683d63d and stored state trie with root 0xe40e9900f63d1539014ce5cc4fa0e4932ae4ed1154c31cd77714d0fdf2d4e8b3	pkg=core
RUNTIME HASH: 0xd1d708433e5d7bcb9fc30a3de67952c1611fa58cd1980146ab07aa438116d352
2024-02-24T03:55:37Z ERROR    block data processing for block with hash 0x0feb9c8bb9018bbcf20b467e15636de02397de00c66233c5b4a06f5be3a2f558 failed: processing block data with header and body: handling block: failed to execute block 17088245: running runtime function: bumping: allocator out of space: current pages 65535 greater than max wasm pages 65535 (recovered by wazero)
wasm stack trace:
	env.ext_allocator_malloc_version_1(i32) i32
	._ZN5sp_io9allocator26extern_host_function_impls6malloc17h175ebffadfcb2d44E(i32) i32
	._ZN5alloc7raw_vec11finish_grow17h97bec91bef32a5ebE.llvm.1609301459894010017(i32,i32,i32,i32)
	._ZN5alloc7raw_vec19RawVec$LT$T$C$A$GT$16reserve_for_push17h1ed9c357b734a946E(i32,i32)
	._ZN18parity_scale_codec5codec19decode_vec_with_len17h15e2f12c18cd243fE(i32,i32,i32)
	._ZN13frame_support7storage8unhashed3get17hb6cef7316d8fee28E(i32,i32,i32)
	._ZN353_$LT$pallet_staking..ExposureOf$LT$T$GT$$u20$as$u20$sp_runtime..traits..Convert$LT$$LT$T$u20$as$u20$frame_system..pallet..Config$GT$..AccountId$C$core..option..Option$LT$pallet_staking..Exposure$LT$$LT$T$u20$as$u20$frame_system..pallet..Config$GT$..AccountId$C$$LT$T$u20$as$u20$pallet_staking..pallet..pallet..Config$GT$..CurrencyBalance$GT$$GT$$GT$$GT$7convert17h469fc883609fa8e1E(i32,i32)
	._ZN5alloc3vec16in_place_collect108_$LT$impl$u20$alloc..vec..spec_from_iter..SpecFromIter$LT$T$C$I$GT$$u20$for$u20$alloc..vec..Vec$LT$T$GT$$GT$9from_iter17h34924d7a99a1d609E(i32,i32)
	._ZN14pallet_session10historical7onchain39store_session_validator_set_to_offchain17h0e99a24533159ab9E(i32)
	._ZN893_$LT$$LP$TupleElement0$C$TupleElement1$C$TupleElement2$C$TupleElement3$C$TupleElement4$C$TupleElement5$C$TupleElement6$C$TupleElement7$C$TupleElement8$C$TupleElement9$C$TupleElement10$C$TupleElement11$C$TupleElement12$C$TupleElement13$C$TupleElement14$C$TupleElement15$C$TupleElement16$C$TupleElement17$C$TupleElement18$C$TupleElement19$C$TupleElement20$C$TupleElement21$C$TupleElement22$C$TupleElement23$C$TupleElement24$C$TupleElement25$C$TupleElement26$C$TupleElement27$C$TupleElement28$C$TupleElement29$C$TupleElement30$C$TupleElement31$C$TupleElement32$C$TupleElement33$C$TupleElement34$C$TupleElement35$C$TupleElement36$C$TupleElement37$C$TupleElement38$C$TupleElement39$C$TupleElement40$C$TupleElement41$C$TupleElement42$C$TupleElement43$C$TupleElement44$C$TupleElement45$C$TupleElement46$C$TupleElement47$RP$$u20$as$u20$frame_support..traits..hooks..OnInitialize$LT$BlockNumber$GT$$GT$13on_initialize17h252b62108cdb4ff8E(i32,i32)
	._ZN15frame_executive104Executive$LT$System$C$Block$C$Context$C$UnsignedValidator$C$AllPalletsWithSystem$C$COnRuntimeUpgrade$GT$16initialize_block17h25c01251da9ce46aE(i32)
	._ZN15frame_executive104Executive$LT$System$C$Block$C$Context$C$UnsignedValidator$C$AllPalletsWithSystem$C$COnRuntimeUpgrade$GT$13execute_block17h04bd9202b85e1554E(i32)
	.Core_execute_block(i32,i32) i64	pkg=sync

edit After a long debug session with @jimjbrettj™ we found that the allocation problem was related with long running instances and that is not trivial to clean up the runtime memory after a block execution (manually clean up), so here is a suggestion of what can be done to solve and a reference to substrate (https://github.com/paritytech/polkadot-sdk/blob/0c6c837f689a287583508506e342ba07687e8d26/substrate/client/executor/wasmtime/src/runtime.rs#L170C39-L170C55)

  • [ ] Instantiate a runtime for every block execution

EclesioMeloJunior avatar Feb 26 '24 12:02 EclesioMeloJunior

Related to the allocator problem I will need to wait for EC2 node to run towards block 17M to validate the changes to the allocator, I cannot reproduce it locally since when I start the node the memory is completely empty when executing the block, differently from the ec2 node where the allocator/memory is being used every imported block. Given that the allocator exports its metrics through Prometheus we can compare the usage now and before (where Gossamer had less one page)

EclesioMeloJunior avatar Mar 11 '24 16:03 EclesioMeloJunior

After investigating substrate FreeingBumpHeapAllocator I notice that Gossamer is not cleaning up the allocator after block execution, instead Gossamer is refreshing the allocator only on every runtime upgrade

Here is the call_impl method which is called when the host needs to call a runtime func

  • https://github.com/paritytech/polkadot-sdk/blob/0c6c837f689a287583508506e342ba07687e8d26/substrate/client/executor/wasmtime/src/runtime.rs#L163C5-L163C14
  • https://github.com/paritytech/polkadot-sdk/blob/0c6c837f689a287583508506e342ba07687e8d26/substrate/client/executor/wasmtime/src/runtime.rs#L170C39-L170C55

Also the PR https://github.com/ChainSafe/gossamer/pull/3800 introduces the fix

EclesioMeloJunior avatar Mar 18 '24 15:03 EclesioMeloJunior

After investigations with @jimjbrettj™ we found that the allocator still facing the same problem, just reinstantiating the allocator is not enough to solve the allocator out of space problem and we are facing wasm out of bounds access problems as shown in the image belown.

image

EclesioMeloJunior avatar Apr 03 '24 21:04 EclesioMeloJunior

Same problem happening on westend staging at block #9412261

image

EclesioMeloJunior avatar Apr 05 '24 13:04 EclesioMeloJunior

While syncing to #17m we faced a super slowdown after a runtime upgrade (0.9.25) which I am investigating at https://github.com/ChainSafe/gossamer/issues/3950

EclesioMeloJunior avatar May 13 '24 15:05 EclesioMeloJunior

The fix was confirmed in our staging nodes given that the latest block finalized was #17272320 image

fyi @jimjbrettj @P1sar @timwu20

EclesioMeloJunior avatar Jun 26 '24 12:06 EclesioMeloJunior