substrate icon indicating copy to clipboard operation
substrate copied to clipboard

Rococo validator using paritydb stuck on startup with "Missing table, starting reindex"

Open PierreBesson opened this issue 3 years ago • 10 comments

I have several Rococo validator which remains stuck after startup. The only way to fix the issue is to restore the database from a backup.

Version: 0.9.37 Relevant flags:

--chain=rococo
--validator
--database=paritydb
--pruning=1000
--enable-offchain-indexing=true
2023-01-19 14:08:16.389  INFO main sc_cli::runner: Parity Polkadot    
2023-01-19 14:08:16.389  INFO main sc_cli::runner: ✌️  version 0.9.37-645723987cf    
2023-01-19 14:08:16.389  INFO main sc_cli::runner: ❤️  by Parity Technologies <[email protected]>, 2017-2023    
2023-01-19 14:08:16.389  INFO main sc_cli::runner: 📋 Chain specification: Rococo    
2023-01-19 14:08:16.389  INFO main sc_cli::runner: 🏷  Node name: rococo-validator-f-node-9    
2023-01-19 14:08:16.389  INFO main sc_cli::runner: 👤 Role: AUTHORITY    
2023-01-19 14:08:16.389  INFO main sc_cli::runner: 💾 Database: ParityDb at /chain-data/chains/rococo_v2_2/paritydb/full    
2023-01-19 14:08:16.389  INFO main sc_cli::runner: ⛓  Native runtime: rococo-9370 (parity-rococo-v2.0-0.tx18.au0)    
2023-01-19 14:08:27.060  WARN main parity-db: Missing table 09-19, starting reindex    
2023-01-19 14:08:27.060  INFO main parity-db: Started reindex for 09-18    

I also noticed that just before the node became stuck like this it experienced huge cpu/memory usage which caused several OOM kills. Maybe the database was corrupted by OOM kills ?

image

PierreBesson avatar Jan 19 '23 14:01 PierreBesson

Stuck in the sense that it did not print anything anymore? Does it still respond to RPC? I remember there was a similar issue about stuck nodes.
cc @bkchr

ggwpez avatar Jan 19 '23 16:01 ggwpez

Yes it does not print anything anymore or respond to RPC.

PierreBesson avatar Jan 19 '23 16:01 PierreBesson

Cc @arkpar @cheme

bkchr avatar Jan 19 '23 18:01 bkchr

Can we get a copy of the database?

arkpar avatar Jan 19 '23 20:01 arkpar

Reports that I've requested in https://github.com/paritytech/devops/issues/1539 would be useful here

arkpar avatar Jan 19 '23 20:01 arkpar

Here is the eu-stack report:

PID 28771 - process
TID 28771:
#0  0x00007f562c3114eb
#1  0x000055a359efe7e5 _$LT$std..io..buffered..bufreader..BufReader$LT$R$GT$$u20$as$u20$std..io..Read$GT$::read_exact::h5ea24cb7b21ce235
#2  0x000055a359f3737f parity_db::index::IndexTable::enact_plan::h776f9c37f4f412db
#3  0x000055a359f36a9e parity_db::column::Column::enact_plan::heb491e23448755eb
#4  0x000055a359f0c707 parity_db::db::DbInner::enact_logs::hdf86a9ce7012f638
#5  0x000055a359f036d0 parity_db::db::Db::open_inner::h339b4797ff8ab9e1
#6  0x000055a35ac92c32 sc_client_db::parity_db::open::h198a5d9b607798de
#7  0x000055a35ac926e5 sc_client_db::utils::open_parity_db::hf4d4efe05aa680f9
#8  0x000055a35ac919d9 sc_client_db::utils::open_database_at::h0df44d5d0dcc1ee0
#9  0x000055a35ac90acc sc_client_db::utils::open_database::hb5792630d9adde8c
#10 0x000055a35ac749eb sc_service::builder::new_db_backend::h784f64f15853cb5b
#11 0x000055a35af44d1f polkadot_service::new_partial_basics::hcd0a94cfdf3ccdcd
#12 0x000055a35a09830e polkadot_service::new_full::h4b35fa300d075633
#13 0x000055a35a091d75 _$LT$core..future..from_generator..GenFuture$LT$T$GT$$u20$as$u20$core..future..future..Future$GT$::poll::h399f21d26f8d9414
#14 0x000055a35a08df38 sc_cli::runner::Runner$LT$C$GT$::run_node_until_exit::hf4c1e2c26018ea20
#15 0x000055a35a065e7d polkadot_cli::command::run::h03cd3a03cb55d5ab
#16 0x000055a359568c11 polkadot::main::hcad9695bb0e3bea5
#17 0x000055a359566b13 std::sys_common::backtrace::__rust_begin_short_backtrace::hc9ae75d99f9dc8d6
#18 0x000055a359566b2e std::rt::lang_start::_$u7b$$u7b$closure$u7d$$u7d$::h6b9bd7c21bb5d6a8
#19 0x000055a359569317 main
#20 0x00007f562c1d909b __libc_start_main
#21 0x000055a35956672a _start
TID 28774:
#0  0x00007f562c2a8af9 syscall
#1  0x000055a359f4c339 parking_lot::condvar::Condvar::wait_until_internal::ha30c80197dd3a4d5
#2  0x000055a35c536057 tokio::runtime::scheduler::multi_thread::worker::Context::park_timeout::h2c9e2442c68c8958
#3  0x000055a35c53517a tokio::runtime::scheduler::multi_thread::worker::run::h76fe39a55166c5fb
#4  0x000055a35c53d52e tokio::runtime::task::raw::poll::h55008311e865ed46
#5  0x000055a35c52bb99 std::sys_common::backtrace::__rust_begin_short_backtrace::h68ca8e65dac86410
#6  0x000055a35c52b835 core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h344e28d62d391c73
#7  0x000055a35c51a805 std::sys::unix::thread::Thread::new::thread_start::h62ca48b42d48a8fc
#8  0x00007f562c506fa3 start_thread
#9  0x00007f562c2ae06f __clone
TID 28775:
#0  0x00007f562c2a8af9 syscall
#1  0x000055a359f4c339 parking_lot::condvar::Condvar::wait_until_internal::ha30c80197dd3a4d5
#2  0x000055a35c536057 tokio::runtime::scheduler::multi_thread::worker::Context::park_timeout::h2c9e2442c68c8958
#3  0x000055a35c53517a tokio::runtime::scheduler::multi_thread::worker::run::h76fe39a55166c5fb
#4  0x000055a35c53d52e tokio::runtime::task::raw::poll::h55008311e865ed46
#5  0x000055a35c52bb99 std::sys_common::backtrace::__rust_begin_short_backtrace::h68ca8e65dac86410
#6  0x000055a35c52b835 core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h344e28d62d391c73
#7  0x000055a35c51a805 std::sys::unix::thread::Thread::new::thread_start::h62ca48b42d48a8fc
#8  0x00007f562c506fa3 start_thread
#9  0x00007f562c2ae06f __clone
TID 28776:
#0  0x00007f562c2ae38f epoll_wait
#1  0x000055a35c53181c tokio::runtime::io::Driver::turn::h51a482e565093d0a
#2  0x000055a35c53143b tokio::runtime::time::Driver::park_internal::h27a4841415f8b919
#3  0x000055a35c5360cc tokio::runtime::scheduler::multi_thread::worker::Context::park_timeout::h2c9e2442c68c8958
#4  0x000055a35c53517a tokio::runtime::scheduler::multi_thread::worker::run::h76fe39a55166c5fb
#5  0x000055a35c53d52e tokio::runtime::task::raw::poll::h55008311e865ed46
#6  0x000055a35c52bb99 std::sys_common::backtrace::__rust_begin_short_backtrace::h68ca8e65dac86410
#7  0x000055a35c52b835 core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h344e28d62d391c73
#8  0x000055a35c51a805 std::sys::unix::thread::Thread::new::thread_start::h62ca48b42d48a8fc
#9  0x00007f562c506fa3 start_thread
#10 0x00007f562c2ae06f __clone
TID 28777:
#0  0x00007f562c2a8af9 syscall
#1  0x000055a359f4c339 parking_lot::condvar::Condvar::wait_until_internal::ha30c80197dd3a4d5
#2  0x000055a35c536057 tokio::runtime::scheduler::multi_thread::worker::Context::park_timeout::h2c9e2442c68c8958
#3  0x000055a35c53517a tokio::runtime::scheduler::multi_thread::worker::run::h76fe39a55166c5fb
#4  0x000055a35c53d52e tokio::runtime::task::raw::poll::h55008311e865ed46
#5  0x000055a35c52bb99 std::sys_common::backtrace::__rust_begin_short_backtrace::h68ca8e65dac86410
#6  0x000055a35c52b835 core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h344e28d62d391c73
#7  0x000055a35c51a805 std::sys::unix::thread::Thread::new::thread_start::h62ca48b42d48a8fc
#8  0x00007f562c506fa3 start_thread
#9  0x00007f562c2ae06f __clone
TID 28778:
#0  0x00007f562c2a8af9 syscall
#1  0x000055a359f4c1ff parking_lot::condvar::Condvar::wait_until_internal::ha30c80197dd3a4d5
#2  0x000055a35c35cc50 std::sys_common::backtrace::__rust_begin_short_backtrace::hebc737661396ee6c
#3  0x000055a35c35ca72 core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h52b4ebd31a30fd80
#4  0x000055a35c51a805 std::sys::unix::thread::Thread::new::thread_start::h62ca48b42d48a8fc
#5  0x00007f562c506fa3 start_thread
#6  0x00007f562c2ae06f __clone

PierreBesson avatar Jan 20 '23 16:01 PierreBesson

Ok, it looks like the database has a lot of unprocessed commits in the queue, that it tries to enact on startup. The node should start eventually. Still, a copy database snapshot would be nice.

~~There seem to be a lot of writes made at some point to the OFFCHAIN column, so I guess this is some offchain worker that decided to insert/update a lot of values at once. @bkchr any ideas what could cause this?~~

arkpar avatar Jan 20 '23 16:01 arkpar

Here's also a copy of the db, compressed with zstd: https://storage.googleapis.com/rococo-blockstore-backups/rococo-paritydb-archive.tar.zst

aardbol avatar Jan 20 '23 16:01 aardbol

The node should start eventually.

@arkpar this particular node is stuck since 4 days and 6 hours and has minimal cpu usage.

PierreBesson avatar Jan 20 '23 16:01 PierreBesson

After examining the database: it seems like there's been a collision in the STATE column. I.e. a lot of similar keys have been inserted which blew up the index to 4TiB. Normally this should not be happening because STATE column only contains trie nodes, so the keys should be uniformly distributed. I'm going to sync rococo to try and reproduce this and get back in few days.

arkpar avatar Jan 20 '23 18:01 arkpar

Fixed in #https://github.com/paritytech/substrate/pull/13226. Note that the database linked in this comment is already damaged. So I'd recommend to make a new sync from scratch.

arkpar avatar Jan 25 '23 08:01 arkpar