mina icon indicating copy to clipboard operation
mina copied to clipboard

Investigate nodes being out-of-sync for long

Open georgeee opened this issue 1 year ago • 6 comments

During running tests on ITN it was noticed that certain nodes keep being in catchup on the ~same block height for a prolonged time.

This requires a brief investigation to confirm this seems to be a known issue with Super-catchup and not anything else.

P.S. initially cases of nodes often being in long catch up were also on radar, but this seems to be entirely due to stop-and-cleanup tackling from the Orchestrator.

georgeee avatar Jan 17 '24 18:01 georgeee

Findings to the date:

Some nodes shutdown and take a very long time to recover. Reasons for shutdown are:

  1. OOM: in this case there is an abrupt end of logs of one execution continued by logs of the next execution a. In this case upon restart there is (often) a failure to load persistent frontier, supposedly because it wasn't properly closed b. Present only on the nodes with low (~15 GB) RAM
  2. Forceful restart by the Orchestrator a. Distinguished by a special log line, not present when running an experiment without stopping the nodes)
  3. Shutdown of Daemon on a timer a. No special log messages and graceful shutdown

In the case of shutdown on a timer it isn't clear why the Mina node requires bootstrap. In other cases this is easily explained. It isn't clear whether the three reasons of shutdown above explain all of the cases (maybe there is a fourth reason not covered).

After shutdown there is normally an extended period (~ 4 hours) of no blocks reported in the best tip log. Reason for why this happens isn't clear. A possible explanation is problems with fetching data from the network.

After the first block is reported to the best tip log (after the long pause), catchup is performed at a steady pace, though way slower than normally expected with an average time to process a block averaging (average taken over 100 consecutive blocks) from 30 seconds to 3 minutes. See table for more details.

georgeee avatar Jan 19 '24 20:01 georgeee

One log that might be related:

2024-01-19 22:10:05.842323Z] Long async cyc 643.546315777 ["serve_client_rpcs","mina"]
[2024-01-19 22:10:05.844397Z] Long async job 5.247238187 ["dispatch_user_command_verification","build_breadcrumb","handle_super_catchup_jobs","perform_super_catchup","transition_router","mina_lib","mina"]

[2024-01-19 22:15:56.092503Z] Long async cyc 226.315262121 ["manage_verifier_subprocess","mina_lib","mina"]
[2024-01-19 22:15:56.115380Z] Long async job 6.764905498 ["handle_super_catchup_jobs","perform_super_catchup","transition_router","mina_lib","mina"]

[2024-01-19 22:15:59.918508Z] Long async cyc 345.769238609 ["handle_snark_pool_gossip","validate_libp2p_gossip","handle_libp2p_ipc_push_gossip_received","handle_libp2p_ipc_push","handle_libp2p_ipc_incoming","manage_libp2p_helper_subprocess","mina_net2","gossip_net","mina_networking","mina_lib","mina"]
[2024-01-19 22:15:59.918631Z] Long async job 46.398950847 ["transition_handler_processor","process_blocks","transition_router","mina_lib","mina"]
[2024-01-19 22:16:00.490956Z] Long async cyc 3.846963739 ["node_status_service"]
[2024-01-19 22:16:00.491111Z] Long async job 581.142782005 ["process_blocks","transition_router","mina_lib","mina"]
[2024-01-19 22:16:00.606648Z] Long async job 220.67789062 ["persistent_frontier_write_to_disk","transition_handler_processor","process_blocks","transition_router","mina_lib","mina"]
[2024-01-19 22:16:00.799422Z] Long async job 344.454421389 []

Logs: logs2-2024-01-20_02-23-54.tar - Rincahya Maragung.gz

UPD: seems like the Mina node was running only on 4vCPU / 8GB RAM, which is too low

georgeee avatar Feb 05 '24 18:02 georgeee

Another instance, in logs provided by user ak 47:

[2024-01-21 14:41:35.596298Z] Long async cyc 630.380755318 ["node_status_service"]
[2024-01-21 14:41:35.598062Z] Long async job 630.347027474 []


[2024-01-21 14:58:44.374674Z] Long async cyc 667.632535441 ["handle_libp2p_ipc_incoming","manage_libp2p_helper_subprocess","mina_net2","gossip_net","mina_networking","mina_lib","mina"]
[2024-01-21 14:58:44.377330Z] Long async job 667.549390928 []


[2024-01-21 14:41:35.596298Z] Long async cyc 630.380755318 ["node_status_service"]
[2024-01-21 14:41:35.598062Z] Long async job 630.347027474 []

[2024-01-21 15:54:57.558952Z] Long async cyc 1834.989848475 ["manage_verifier_subprocess","mina_lib","mina"]
[2024-01-21 15:54:57.559931Z] Long async job 1834.688906114 []
[2024-01-21 15:57:10.996603Z] Long async cyc 1.477091277 ["manage_verifier_subprocess","mina_lib","mina"]
[2024-01-21 15:57:11.355608Z] Long async job 129.714462255 ["collect_gc_metrics","mina"]
[2024-01-21 15:57:12.302686Z] Long async cyc 0.761158801 ["manage_verifier_subprocess","mina_lib","mina"]
[2024-01-21 15:57:33.500849Z] Long async cyc 131.315848106 ["node_status_service"]
[2024-01-21 15:57:33.531972Z] Long async cyc 0.822287762 ["execute_gossip_net_bans","gossip_net","mina_networking","mina_lib","mina"]
[2024-01-21 15:57:33.538497Z] Long async cyc 0.792545033 ["execute_gossip_net_bans","gossip_net","mina_networking","mina_lib","mina"]
[2024-01-21 15:57:35.841373Z] Long async cyc 20.489299314 ["execute_gossip_net_bans","gossip_net","mina_networking","mina_lib","mina"]
[2024-01-21 15:57:35.842240Z] Long async job 20.409901466 ["manage_prover_subprocess","mina_lib","mina"]
[2024-01-21 15:57:43.489611Z] Long async cyc 9.915455203 ["manage_verifier_subprocess","mina_lib","mina"]
[2024-01-21 15:57:43.489858Z] Long async job 7.385109333 ["manage_prover_subprocess","mina_lib","mina"]

georgeee avatar Feb 05 '24 21:02 georgeee

From niuniu:

[2024-01-21 10:25:49.236056Z] Long async cyc 15.101878393 []
[2024-01-21 10:25:49.236086Z] Long async job 15.101868334 []

georgeee avatar Feb 05 '24 21:02 georgeee

Seems like nearly all of the logs analyzed contain instances of Gc.compact taking 10-30s or more, in one occasion all the way up to 30 minutes.

At least in some occasions, this is without doubt cause of troubles

Log groups from different users:

ak 47

["2024-01-21 12:53:17.995893Z",504.8221447467804]
["2024-01-21 13:08:13.160847Z",528.0134308338165]
["2024-01-21 13:24:43.122416Z",629.7525911331177]
["2024-01-21 13:40:11.672454Z",567.0497577190399]
["2024-01-21 13:54:12.227903Z",464.976078748703]
["2024-01-21 14:09:37.936635Z",565.6846795082092]
["2024-01-21 14:25:05.148347Z",219.71170234680176]
["2024-01-21 14:41:35.517312Z",630.3250269889832]
["2024-01-21 14:58:44.204328Z",667.5209245681763]
["2024-01-21 15:18:21.888480Z",808.6710586547852]
["2024-01-21 15:54:56.736854Z",1834.6435616016388]
["2024-01-21 17:14:33.017003Z",283.88019156455994]
["2024-01-21 17:22:25.127257Z",112.0936028957367]
["2024-01-21 17:31:03.551978Z",156.2179398536682]
["2024-01-21 17:46:48.889551Z",107.29936933517456]
["2024-01-22 07:42:36.127431Z",144.90250945091248]

Ysn

["2024-01-18 04:01:11.576935Z",114.9892680644989]
["2024-01-18 04:31:56.495882Z",144.00046157836914]
["2024-01-18 04:47:16.685208Z",119.0544102191925]
["2024-01-18 05:02:37.598403Z",111.06834268569946]
["2024-01-18 05:10:48.130463Z",130.51965308189392]
["2024-01-18 05:18:55.748143Z",127.60154604911804]
["2024-01-18 05:34:35.291416Z",120.59852647781372]
["2024-01-18 05:42:39.285534Z",123.97581696510315]
["2024-01-18 05:50:44.564362Z",125.27789759635925]
["2024-01-18 05:58:46.654309Z",122.08535671234131]
["2024-01-18 06:07:15.829793Z",145.20384335517883]
["2024-01-18 06:15:13.710123Z",117.87411403656006]
["2024-01-18 06:30:29.673280Z",111.04501104354858]
["2024-01-18 06:38:19.310957Z",109.55625772476196]
["2024-01-18 06:45:59.796708Z",100.48231744766235]
["2024-01-18 07:01:40.042561Z",124.09529113769531]
["2024-01-18 07:16:48.342463Z",111.95005464553833]
["2024-01-18 07:39:31.215536Z",100.43886828422546]
["2024-01-18 07:48:01.972955Z",150.46845698356628]
["2024-01-18 07:56:36.911065Z",154.57601141929626]
["2024-01-18 08:05:30.016071Z",172.62894654273987]

niu-niu

["2024-01-21 16:55:25.148050Z",18.685834646224976]
["2024-01-21 17:01:43.753091Z",18.6048583984375]
["2024-01-21 17:08:02.586397Z",18.833157539367676]
["2024-01-21 15:20:38.185533Z",16.3585524559021]
["2024-01-21 15:26:54.699624Z",16.513938426971436]
["2024-01-21 15:33:14.098525Z",19.39867091178894]
["2024-01-21 15:39:30.225722Z",16.126976251602173]
["2024-01-21 15:45:48.512969Z",18.286989450454712]
["2024-01-21 14:55:31.358576Z",16.216827869415283]
["2024-01-21 15:01:49.438456Z",18.079686403274536]
["2024-01-21 15:08:05.594098Z",16.155375480651855]
["2024-01-21 15:14:21.826820Z",16.232566833496094]
["2024-01-21 10:13:20.988442Z",11.432328224182129]
["2024-01-21 10:19:34.133867Z",13.145203828811646]
["2024-01-21 10:25:49.235771Z",15.101763486862183]
["2024-01-21 10:32:07.929223Z",18.6564519405365]
["2024-01-21 10:38:28.192576Z",20.24768042564392]

Liviu

["2024-01-17 19:27:39.537793Z",23.92294430732727]
["2024-01-17 19:33:54.985830Z",15.44755744934082]
["2024-01-17 19:40:16.583958Z",21.597748279571533]
["2024-01-17 19:46:35.787190Z",19.202601671218872]
["2024-01-17 19:53:03.561374Z",27.440755128860474]
["2024-01-17 19:59:42.867961Z",39.09192252159119]
["2024-01-17 20:06:28.624691Z",45.75595951080322]
["2024-01-17 20:13:37.427498Z",68.80238008499146]
["2024-01-17 20:20:26.372814Z",46.59816002845764]
["2024-01-17 20:27:15.336323Z",48.96280574798584]
["2024-01-17 20:34:02.424156Z",47.08300566673279]
["2024-01-17 20:41:04.949803Z",62.52491998672485]
["2024-01-17 20:47:58.562659Z",53.60913801193237]
["2024-01-17 20:54:38.277768Z",39.71366214752197]
["2024-01-17 21:01:30.714395Z",52.436030626297]
["2024-01-17 21:08:23.250472Z",50.81245136260986]
["2024-01-17 21:15:09.280583Z",46.02927613258362]
["2024-01-17 21:21:55.073071Z",45.79161286354065]
["2024-01-17 21:28:51.901728Z",56.371169328689575]
["2024-01-17 21:35:49.928960Z",58.026756048202515]
["2024-01-17 21:42:12.067204Z",22.137787342071533]
["2024-01-17 21:48:34.426649Z",22.35802698135376]
["2024-01-17 21:55:00.215158Z",25.788029670715332]
["2024-01-17 22:01:20.219640Z",20.004000186920166]
["2024-01-17 22:07:40.573494Z",20.35329532623291]
["2024-01-17 22:14:33.097658Z",47.98686909675598]
["2024-01-17 22:21:45.575707Z",69.74341607093811]
["2024-01-17 22:28:47.945488Z",62.36924982070923]
["2024-01-17 22:35:50.976667Z",62.80559754371643]
["2024-01-17 22:42:46.809643Z",55.827723264694214]
["2024-01-17 22:49:14.001991Z",27.191711902618408]
["2024-01-17 22:55:46.237365Z",32.23403596878052]
["2024-01-17 23:02:09.756402Z",23.518356323242188]
["2024-01-17 23:08:50.906248Z",41.14815354347229]
["2024-01-17 23:15:49.075514Z",58.15772008895874]
["2024-01-17 23:22:55.747966Z",64.9766960144043]
["2024-01-17 23:29:58.201006Z",62.45245885848999]
["2024-01-17 23:37:11.735059Z",72.58862519264221]
["2024-01-17 23:44:24.573516Z",72.8346357345581]
["2024-01-17 23:52:03.544305Z",98.96986317634583]
["2024-01-17 23:58:56.459084Z",52.28652310371399]
["2024-01-18 00:06:20.615374Z",84.15543603897095]
["2024-01-18 00:13:27.141353Z",66.52414989471436]
["2024-01-18 00:20:46.189733Z",76.92704391479492]
["2024-01-18 00:27:47.748258Z",61.54969644546509]
["2024-01-18 00:34:28.548274Z",40.799317598342896]
["2024-01-18 00:41:32.604480Z",64.01076698303223]
["2024-01-18 00:48:43.269559Z",70.64451122283936]
["2024-01-18 00:55:49.594661Z",66.32388639450073]
["2024-01-18 01:03:06.222958Z",75.27819681167603]
["2024-01-18 01:10:00.788479Z",52.98515844345093]
["2024-01-18 01:17:25.233504Z",84.4407172203064]
["2024-01-18 01:25:06.021507Z",100.78377676010132]
["2024-01-18 01:32:39.979243Z",92.04528284072876]
["2024-01-18 01:40:32.225894Z",112.2389485836029]
["2024-01-18 01:48:52.886002Z",140.65778303146362]
["2024-01-18 01:56:10.134610Z",75.36942791938782]
["2024-01-18 02:03:14.727767Z",64.4670922756195]
["2024-01-18 02:10:20.056999Z",65.32743859291077]
["2024-01-18 02:17:23.126941Z",60.836493492126465]
["2024-01-18 02:23:57.418354Z",34.290013790130615]
["2024-01-18 02:30:37.547232Z",40.12626075744629]
["2024-01-18 02:37:19.375234Z",41.44810700416565]
["2024-01-18 02:44:03.675936Z",44.297035217285156]
["2024-01-18 02:50:52.889566Z",49.21195888519287]
["2024-01-18 02:58:10.733491Z",77.84350872039795]
["2024-01-18 03:05:17.275514Z",66.54068732261658]
["2024-01-18 03:12:37.287831Z",80.01174330711365]
["2024-01-18 03:20:28.999256Z",110.8835928440094]
["2024-01-18 03:28:05.892372Z",86.14221620559692]
["2024-01-18 03:35:44.993873Z",99.10082530975342]
["2024-01-18 03:43:28.048062Z",103.05302619934082]
["2024-01-18 03:51:19.611237Z",101.58762264251709]
["2024-01-18 03:59:18.587034Z",118.97065949440002]
["2024-01-18 04:06:51.004445Z",92.41347813606262]
["2024-01-18 04:15:21.420450Z",150.4155125617981]
["2024-01-18 04:22:42.301204Z",80.87987399101257]
["2024-01-18 04:30:12.354346Z",90.05148768424988]
["2024-01-18 04:36:56.043824Z",31.890715837478638]
["2024-01-18 04:43:47.262435Z",51.217018604278564]
["2024-01-18 04:50:20.373271Z",33.11041283607483]
["2024-01-18 04:57:01.181482Z",40.68862795829773]
["2024-01-18 05:03:51.205611Z",50.02344584465027]
["2024-01-18 05:10:22.015936Z",30.80981469154358]
["2024-01-18 05:17:08.275368Z",46.23347783088684]
["2024-01-18 05:23:41.405522Z",33.12946677207947]
["2024-01-18 05:30:16.129038Z",34.72152352333069]
["2024-01-18 05:37:12.562662Z",56.241315603256226]
["2024-01-18 05:44:48.099274Z",95.37126994132996]
["2024-01-18 05:52:29.813174Z",101.7082839012146]
["2024-01-18 06:02:41.246000Z",251.42422008514404]
["2024-01-18 06:10:35.196414Z",59.441434144973755]
["2024-01-18 06:18:30.838566Z",115.64164090156555]
["2024-01-18 06:26:59.100117Z",147.9541871547699]
["2024-01-18 06:35:09.152436Z",126.15866589546204]
["2024-01-18 06:42:59.685029Z",110.5280237197876]
["2024-01-18 06:50:24.324989Z",84.63778614997864]
["2024-01-18 06:57:13.347963Z",49.0224883556366]
["2024-01-18 07:04:13.642830Z",60.29296827316284]
["2024-01-18 07:11:04.574417Z",49.07143044471741]
["2024-01-18 07:17:43.400118Z",38.82521605491638]
["2024-01-18 07:24:42.062926Z",58.66053223609924]
["2024-01-18 07:31:18.334612Z",36.271268367767334]
["2024-01-18 07:38:13.046413Z",54.7070095539093]
["2024-01-18 07:44:53.309155Z",40.26231503486633]
["2024-01-18 07:52:06.390144Z",71.48199677467346]
["2024-01-18 08:00:21.238935Z",134.84619784355164]
["2024-01-18 08:07:55.204526Z",93.83891272544861]
["2024-01-18 08:17:23.592919Z",208.1533181667328]
["2024-01-18 08:24:37.509010Z",72.05350375175476]
["2024-01-18 08:32:44.739150Z",127.2284288406372]
["2024-01-18 08:39:54.361592Z",64.97270202636719]
["2024-01-18 08:49:56.507798Z",242.14460134506226]
["2024-01-18 08:58:41.737311Z",165.22016716003418]
["2024-01-18 09:06:11.787776Z",90.04574346542358]
["2024-01-18 09:15:49.435826Z",201.10381031036377]

georgeee avatar Feb 05 '24 23:02 georgeee

Logs: dir1, dir2

georgeee avatar Feb 05 '24 23:02 georgeee