symbol-bootstrap
symbol-bootstrap copied to clipboard
Node crashes when broker can't connect to MondoDB
There's an issue with the Symbol node. If the broker can't connect to MongoDB (for some reason), then it crashes the node.
2021-07-15 07:30:27.362775 0x00007f609d2da400: <info> (plugins::PluginModule.cpp@83) loading plugin from /usr/catapult/lib/libcatapult.mongo.plugins.accountlink.so
2021-07-15 07:30:58.034710 0x00007f609d2da400: <error> (extensions::ProcessBootstrapper.cpp@103) Throwing exception: Throw location unknown (consider using BOOST_THROW_EXCEPTION)
Dynamic exception type: boost::wrapexcept<boost::exception_detail::current_exception_std_exception_wrapper<std::runtime_error> >
std::exception::what: No suitable servers found: `serverSelectionTimeoutMS` expired: [Failed to connect to target host: db:27017]: generic server error
mongocxx::v_noabi::bulk_write_exception
2021-07-15 07:30:58.034785 0x00007f609d2da400: <fatal> (process::ProcessMain.cpp@75)
thread: broker catapult
unhandled exception while running local node!
/catapult-src/src/catapult/extensions/ProcessBootstrapper.cpp(103): Throw in function void catapult::extensions::{anonymous}::LoadExtension(const catapult::plugins::PluginModule&, catapult::extensions::ProcessBootstrapper&)
Dynamic exception type: boost::wrapexcept<catapult::plugins::plugin_load_error>
std::exception::what: Throw location unknown (consider using BOOST_THROW_EXCEPTION)
Dynamic exception type: boost::wrapexcept<boost::exception_detail::current_exception_std_exception_wrapper<std::runtime_error> >
std::exception::what: No suitable servers found: `serverSelectionTimeoutMS` expired: [Failed to connect to target host: db:27017]: generic server error
mongocxx::v_noabi::bulk_write_exception
/symbol-commands/start.sh: line 126: 11 Aborted (core dumped) $catapultAppFolder/bin/$processName "$config"
The only way after that - is to make resetData. Broker and node try to recover, but it doesn't work for them.
@Jaguar0625 @gimer @Wayonb is this one catapult?
@alex-core-0x The broker requires MongoDB to be running first. Do you know why the MongoDB is not running? Do you have the MongoDB logs?
@alex-core-0x The broker requires MongoDB to be running first. Do you know why the MongoDB is not running? Do you have the MongoDB logs?
@Wayonb
The server lags hugely (for some reason) when nodes are trying to recover. The resources of CPU, RAM, disk are more than enough.
MongoDB lags (see the log below with slow query
) -> node crashes while querying to it (seems like timeout) -> node is dead.
{"t":{"$date":"2021-07-20T02:53:08.803+00:00"},"s":"I", "c":"COMMAND", "id":51803, "ctx":"conn26","msg":"Slow query","attr":{"type":"command","ns":"catapult.finalizedBlocks","command":{"find":"finalizedBlocks","filter":{},"sort":{"block.height":-1},"projection":{"_id":0},"limit":1,"singleBatch":true,"batchSize":1,"returnKey":false,"showRecordId":false,"lsid":{"id":{"$uuid":"666bae97-66c8-4170-8866-625abffa688a"}},"$db":"catapult"},"planSummary":"IXSCAN { block.height: -1 }","keysExamined":1,"docsExamined":1,"cursorExhausted":true,"numYields":0,"nreturned":1,"queryHash":"E6B6F0C5","planCacheKey":"E6B6F0C5","reslen":238,"locks":{"ReplicationStateTransition":{"acquireCount":{"w":1}},"Global":{"acquireCount":{"r":1}},"Database":{"acquireCount":{"r":1}},"Collection":{"acquireCount":{"r":1}},"Mutex":{"acquireCount":{"r":1}}},"storage":{},"protocol":"op_msg","durationMillis":152}}
{"t":{"$date":"2021-07-20T02:53:58.471+00:00"},"s":"I", "c":"COMMAND", "id":51803, "ctx":"conn26","msg":"Slow query","attr":{"type":"command","ns":"catapult.finalizedBlocks","command":{"find":"finalizedBlocks","filter":{},"sort":{"block.height":-1},"projection":{"_id":0},"limit":1,"singleBatch":true,"batchSize":1,"returnKey":false,"showRecordId":false,"lsid":{"id":{"$uuid":"666bae97-66c8-4170-8866-625abffa688a"}},"$db":"catapult"},"planSummary":"IXSCAN { block.height: -1 }","keysExamined":1,"docsExamined":1,"cursorExhausted":true,"numYields":0,"nreturned":1,"queryHash":"E6B6F0C5","planCacheKey":"E6B6F0C5","reslen":238,"locks":{"ReplicationStateTransition":{"acquireCount":{"w":1}},"Global":{"acquireCount":{"r":1}},"Database":{"acquireCount":{"r":1}},"Collection":{"acquireCount":{"r":1}},"Mutex":{"acquireCount":{"r":1}}},"storage":{},"protocol":"op_msg","durationMillis":107}}
{"t":{"$date":"2021-07-20T03:25:04.664+00:00"},"s":"I", "c":"COMMAND", "id":51803, "ctx":"conn26","msg":"Slow query","attr":{"type":"command","ns":"catapult.blocks","command":{"count":"blocks","lsid":{"id":{"$uuid":"2fcc8a2e-1b89-43bd-ad2d-d9cca9e64cdf"}},"$db":"catapult"},"planSummary":"RECORD_STORE_FAST_COUNT","keysExamined":0,"docsExamined":0,"numYields":0,"reslen":45,"locks":{"ReplicationStateTransition":{"acquireCount":{"w":1}},"Global":{"acquireCount":{"r":1}},"Database":{"acquireCount":{"r":1}},"Collection":{"acquireCount":{"r":1}},"Mutex":{"acquireCount":{"r":1}}},"protocol":"op_msg","durationMillis":696}}
{"t":{"$date":"2021-07-20T03:25:04.664+00:00"},"s":"I", "c":"COMMAND", "id":51803, "ctx":"conn23","msg":"Slow query","attr":{"type":"command","ns":"catapult.transactions","command":{"count":"transactions","lsid":{"id":{"$uuid":"666bae97-66c8-4170-8866-625abffa688a"}},"$db":"catapult"},"planSummary":"RECORD_STORE_FAST_COUNT","keysExamined":0,"docsExamined":0,"numYields":0,"reslen":45,"locks":{"ReplicationStateTransition":{"acquireCount":{"w":1}},"Global":{"acquireCount":{"r":1}},"Database":{"acquireCount":{"r":1}},"Collection":{"acquireCount":{"r":1}},"Mutex":{"acquireCount":{"r":1}}},"protocol":"op_msg","durationMillis":696}}
{"t":{"$date":"2021-07-20T03:25:04.841+00:00"},"s":"I", "c":"COMMAND", "id":51803, "ctx":"conn25","msg":"Slow query","attr":{"type":"command","ns":"catapult.accounts","command":{"count":"accounts","lsid":{"id":{"$uuid":"e8ae4459-9cda-45b2-8098-f7003ae1a43a"}},"$db":"catapult"},"planSummary":"RECORD_STORE_FAST_COUNT","keysExamined":0,"docsExamined":0,"numYields":0,"reslen":45,"locks":{"ReplicationStateTransition":{"acquireCount":{"w":1}},"Global":{"acquireCount":{"r":1}},"Database":{"acquireCount":{"r":1}},"Collection":{"acquireCount":{"r":1}},"Mutex":{"acquireCount":{"r":1}}},"protocol":"op_msg","durationMillis":696}}
Here are logs below, that list frequently.
2021-07-20 06:33:52.927090 0x00007f40e3fff700:
- 0ms: 'save the peer chain into storage' (28ms)
- 28ms: 'prune delta cache' (0ms)
- 28ms: 'indicate a state change' (18ms)
- 46ms: 'commit changes to the in-memory cache' (5248ms)
- 5294ms: 'commit changes to the primary block chain storage' (76ms)
- 5370ms: 'update the unconfirmed transactions' (20ms)
@alex-core-0x Thanks for the information above. Its not enough information to determine whats going on. Is it possible to get full logs?
The slow operation detected:
should not cause the issue with the mongodb.
What is your node resources setup(CPU, RAM)?
Do you know when the mongodb crashes? i.e. what going on the server at the time.