symbol-bootstrap icon indicating copy to clipboard operation
symbol-bootstrap copied to clipboard

Node crashes when broker can't connect to MondoDB

Open alex-core-0x opened this issue 3 years ago • 4 comments

There's an issue with the Symbol node. If the broker can't connect to MongoDB (for some reason), then it crashes the node.

2021-07-15 07:30:27.362775 0x00007f609d2da400: <info> (plugins::PluginModule.cpp@83) loading plugin from /usr/catapult/lib/libcatapult.mongo.plugins.accountlink.so
2021-07-15 07:30:58.034710 0x00007f609d2da400: <error> (extensions::ProcessBootstrapper.cpp@103) Throwing exception: Throw location unknown (consider using BOOST_THROW_EXCEPTION)
Dynamic exception type: boost::wrapexcept<boost::exception_detail::current_exception_std_exception_wrapper<std::runtime_error> >
std::exception::what: No suitable servers found: `serverSelectionTimeoutMS` expired: [Failed to connect to target host: db:27017]: generic server error
mongocxx::v_noabi::bulk_write_exception
2021-07-15 07:30:58.034785 0x00007f609d2da400: <fatal> (process::ProcessMain.cpp@75)
thread: broker catapult
unhandled exception while running local node!
/catapult-src/src/catapult/extensions/ProcessBootstrapper.cpp(103): Throw in function void catapult::extensions::{anonymous}::LoadExtension(const catapult::plugins::PluginModule&, catapult::extensions::ProcessBootstrapper&)
Dynamic exception type: boost::wrapexcept<catapult::plugins::plugin_load_error>
std::exception::what: Throw location unknown (consider using BOOST_THROW_EXCEPTION)
Dynamic exception type: boost::wrapexcept<boost::exception_detail::current_exception_std_exception_wrapper<std::runtime_error> >
std::exception::what: No suitable servers found: `serverSelectionTimeoutMS` expired: [Failed to connect to target host: db:27017]: generic server error
mongocxx::v_noabi::bulk_write_exception

/symbol-commands/start.sh: line 126:    11 Aborted                 (core dumped) $catapultAppFolder/bin/$processName "$config"

The only way after that - is to make resetData. Broker and node try to recover, but it doesn't work for them.

alex-core-0x avatar Jul 15 '21 12:07 alex-core-0x

@Jaguar0625 @gimer @Wayonb is this one catapult?

fboucquez avatar Jul 18 '21 03:07 fboucquez

@alex-core-0x The broker requires MongoDB to be running first. Do you know why the MongoDB is not running? Do you have the MongoDB logs?

Wayonb avatar Jul 18 '21 15:07 Wayonb

@alex-core-0x The broker requires MongoDB to be running first. Do you know why the MongoDB is not running? Do you have the MongoDB logs?

@Wayonb

The server lags hugely (for some reason) when nodes are trying to recover. The resources of CPU, RAM, disk are more than enough.

MongoDB lags (see the log below with slow query) -> node crashes while querying to it (seems like timeout) -> node is dead.

{"t":{"$date":"2021-07-20T02:53:08.803+00:00"},"s":"I",  "c":"COMMAND",  "id":51803,   "ctx":"conn26","msg":"Slow query","attr":{"type":"command","ns":"catapult.finalizedBlocks","command":{"find":"finalizedBlocks","filter":{},"sort":{"block.height":-1},"projection":{"_id":0},"limit":1,"singleBatch":true,"batchSize":1,"returnKey":false,"showRecordId":false,"lsid":{"id":{"$uuid":"666bae97-66c8-4170-8866-625abffa688a"}},"$db":"catapult"},"planSummary":"IXSCAN { block.height: -1 }","keysExamined":1,"docsExamined":1,"cursorExhausted":true,"numYields":0,"nreturned":1,"queryHash":"E6B6F0C5","planCacheKey":"E6B6F0C5","reslen":238,"locks":{"ReplicationStateTransition":{"acquireCount":{"w":1}},"Global":{"acquireCount":{"r":1}},"Database":{"acquireCount":{"r":1}},"Collection":{"acquireCount":{"r":1}},"Mutex":{"acquireCount":{"r":1}}},"storage":{},"protocol":"op_msg","durationMillis":152}}
{"t":{"$date":"2021-07-20T02:53:58.471+00:00"},"s":"I",  "c":"COMMAND",  "id":51803,   "ctx":"conn26","msg":"Slow query","attr":{"type":"command","ns":"catapult.finalizedBlocks","command":{"find":"finalizedBlocks","filter":{},"sort":{"block.height":-1},"projection":{"_id":0},"limit":1,"singleBatch":true,"batchSize":1,"returnKey":false,"showRecordId":false,"lsid":{"id":{"$uuid":"666bae97-66c8-4170-8866-625abffa688a"}},"$db":"catapult"},"planSummary":"IXSCAN { block.height: -1 }","keysExamined":1,"docsExamined":1,"cursorExhausted":true,"numYields":0,"nreturned":1,"queryHash":"E6B6F0C5","planCacheKey":"E6B6F0C5","reslen":238,"locks":{"ReplicationStateTransition":{"acquireCount":{"w":1}},"Global":{"acquireCount":{"r":1}},"Database":{"acquireCount":{"r":1}},"Collection":{"acquireCount":{"r":1}},"Mutex":{"acquireCount":{"r":1}}},"storage":{},"protocol":"op_msg","durationMillis":107}}
{"t":{"$date":"2021-07-20T03:25:04.664+00:00"},"s":"I",  "c":"COMMAND",  "id":51803,   "ctx":"conn26","msg":"Slow query","attr":{"type":"command","ns":"catapult.blocks","command":{"count":"blocks","lsid":{"id":{"$uuid":"2fcc8a2e-1b89-43bd-ad2d-d9cca9e64cdf"}},"$db":"catapult"},"planSummary":"RECORD_STORE_FAST_COUNT","keysExamined":0,"docsExamined":0,"numYields":0,"reslen":45,"locks":{"ReplicationStateTransition":{"acquireCount":{"w":1}},"Global":{"acquireCount":{"r":1}},"Database":{"acquireCount":{"r":1}},"Collection":{"acquireCount":{"r":1}},"Mutex":{"acquireCount":{"r":1}}},"protocol":"op_msg","durationMillis":696}}
{"t":{"$date":"2021-07-20T03:25:04.664+00:00"},"s":"I",  "c":"COMMAND",  "id":51803,   "ctx":"conn23","msg":"Slow query","attr":{"type":"command","ns":"catapult.transactions","command":{"count":"transactions","lsid":{"id":{"$uuid":"666bae97-66c8-4170-8866-625abffa688a"}},"$db":"catapult"},"planSummary":"RECORD_STORE_FAST_COUNT","keysExamined":0,"docsExamined":0,"numYields":0,"reslen":45,"locks":{"ReplicationStateTransition":{"acquireCount":{"w":1}},"Global":{"acquireCount":{"r":1}},"Database":{"acquireCount":{"r":1}},"Collection":{"acquireCount":{"r":1}},"Mutex":{"acquireCount":{"r":1}}},"protocol":"op_msg","durationMillis":696}}
{"t":{"$date":"2021-07-20T03:25:04.841+00:00"},"s":"I",  "c":"COMMAND",  "id":51803,   "ctx":"conn25","msg":"Slow query","attr":{"type":"command","ns":"catapult.accounts","command":{"count":"accounts","lsid":{"id":{"$uuid":"e8ae4459-9cda-45b2-8098-f7003ae1a43a"}},"$db":"catapult"},"planSummary":"RECORD_STORE_FAST_COUNT","keysExamined":0,"docsExamined":0,"numYields":0,"reslen":45,"locks":{"ReplicationStateTransition":{"acquireCount":{"w":1}},"Global":{"acquireCount":{"r":1}},"Database":{"acquireCount":{"r":1}},"Collection":{"acquireCount":{"r":1}},"Mutex":{"acquireCount":{"r":1}}},"protocol":"op_msg","durationMillis":696}}

Here are logs below, that list frequently.

2021-07-20 06:33:52.927090 0x00007f40e3fff700: (utils::StackLogger.h@77) slow operation detected: 'BlockChainSyncConsumer::commitAll' (5390ms)

  • 0ms: 'save the peer chain into storage' (28ms)
  • 28ms: 'prune delta cache' (0ms)
  • 28ms: 'indicate a state change' (18ms)
  • 46ms: 'commit changes to the in-memory cache' (5248ms)
  • 5294ms: 'commit changes to the primary block chain storage' (76ms)
  • 5370ms: 'update the unconfirmed transactions' (20ms)

alex-core-0x avatar Jul 20 '21 16:07 alex-core-0x

@alex-core-0x Thanks for the information above. Its not enough information to determine whats going on. Is it possible to get full logs? The slow operation detected: should not cause the issue with the mongodb. What is your node resources setup(CPU, RAM)?
Do you know when the mongodb crashes? i.e. what going on the server at the time.

Wayonb avatar Jul 21 '21 02:07 Wayonb