bsc icon indicating copy to clipboard operation
bsc copied to clipboard

Occasional multi-second lags

Open kawahwookiee opened this issue 3 years ago • 13 comments

Hi. I've just synced a full node from a Feb 2 snapshot. It's been running for several hours now, mostly ok, howerver, there are occasional lags or freezes that violate the ~3 sec period between blocks. The logs look something like this:

.....................
t=2022-02-05T13:02:19+0000 lvl=info msg="Imported new chain segment"             blocks=1  txs=222  mgas=32.900  elapsed=190.457ms mgasps=172.742  number=14,990,583 hash=0x2b354b6470cb72ca41aec5542b58bcfa02bc7ca24d3b37a34ca34928316c1a48 dirty="749.19 MiB"
t=2022-02-05T13:02:22+0000 lvl=info msg="Imported new chain segment"             blocks=1  txs=399  mgas=47.577  elapsed=289.242ms mgasps=164.488  number=14,990,584 hash=0xf084d45206bffd1b2d7d77bebe7b46095094e3e63583a5abd40ec0e52c7f67b0 dirty="751.10 MiB"
t=2022-02-05T13:02:25+0000 lvl=info msg="Imported new chain segment"             blocks=1  txs=248  mgas=37.764  elapsed=210.122ms mgasps=179.724  number=14,990,585 hash=0x35f40af3e96b2a7780b416172e46a183250dcf5e1fb32905d0d3b378be356c37 dirty="752.87 MiB"
t=2022-02-05T13:02:28+0000 lvl=info msg="Imported new chain segment"             blocks=1  txs=289  mgas=40.968  elapsed=239.823ms mgasps=170.826  number=14,990,586 hash=0xd60d54c652691c785558956ca59496bc16d29ef2c72a03ec072d860b56a6749b dirty="754.69 MiB"
t=2022-02-05T13:02:32+0000 lvl=eror msg="Snapshot extension registration failed" peer=1825f0cc err="peer connected on snap without compatible eth support"
t=2022-02-05T13:02:37+0000 lvl=info msg="Deep froze chain segment"               blocks=18 elapsed=8.447ms   number=14,900,586 hash=0xeaace8cd9475c0fad16440e39f51fac8d3e92254383ee180d8abd3d8bdc21777
t=2022-02-05T13:02:38+0000 lvl=info msg="Downloader queue stats"                 receiptTasks=0 blockTasks=0 itemSize="71.10 KiB" throttle=922
t=2022-02-05T13:02:41+0000 lvl=info msg="Imported new chain segment"             blocks=1  txs=165  mgas=33.456  elapsed=10.495s   mgasps=3.188    number=14,990,587 hash=0x35695ac468ccb6c516d1c4500fdc90af196b020cef00737c496c6ecb1f1278d8 dirty="754.85 MiB"
t=2022-02-05T13:02:41+0000 lvl=info msg="do light process success at block"      num=14,990,588
.....................

Has anyone encountered similar problems? How do I solve that? The server has plenty of horsepower, it's a 64 core AMD CPU w/ 64GB and 3.5 TB NVMe

kawahwookiee avatar Feb 05 '22 13:02 kawahwookiee

what does this Downloader queue stats line actually mean?

kawahwookiee avatar Feb 06 '22 09:02 kawahwookiee

for me happens consistently in the "Downloader queue stats" log too.

ascunha avatar Feb 07 '22 07:02 ascunha

I am not sure that Downloader queue stats is the actual culprit. It seems more like the result of the problem. when this happens, the node start lagging several blocks behind. what could the reason be? @j75689 you've helped on several occasions, perhaps, you could share some insight now?

kawahwookiee avatar Feb 07 '22 08:02 kawahwookiee

Can you do profiling when it lags, and attach the profile log here?

keefel avatar Feb 09 '22 06:02 keefel

Can you do profiling when it lags, and attach the profile log here?

Do you mean running the node with --prpof ?

kawahwookiee avatar Feb 09 '22 08:02 kawahwookiee

@KeefeL I have these detailed logs in the vicinity of the lag: here we have the last block before the lag:

...
Feb 09 08:44:15 DS7959 geth[652106]: DEBUG[02-09|08:44:15.362] Snapshot updated                         blockRoot=a0914c..e04fb0                                                                                   
Feb 09 08:44:15 DS7959 geth[652106]: DEBUG[02-09|08:44:15.362] Inserted new block                       number=15,100,522 hash=ab38c6..976a93 uncles=0 txs=154  gas=28,037,036 elapsed=59.387ms  root=a0914c..e04fb
0                                                                                                                                                                                                                  
Feb 09 08:44:15 DS7959 geth[652106]: INFO [02-09|08:44:15.362] Imported new chain segment               blocks=1  txs=154  mgas=28.037  elapsed=60.337ms  mgasps=464.672 number=15,100,522 hash=ab38c6..976a93 dirt
y=151.91MiB

then, a 6+ second gap filled with messages like that:

...
Feb 09 08:44:16 DS7959 geth[652106]: DEBUG[02-09|08:44:16.165] Dropping transaction announcement        id=fd3697620fd7be4e conn=inbound               count=1                                                     
Feb 09 08:44:16 DS7959 geth[652106]: DEBUG[02-09|08:44:16.166] Transaction broadcast                    txs=1    announce packs=54 announced hashes=54    tx packs=7  broadcast txs=7                              
Feb 09 08:44:16 DS7959 geth[652106]: DEBUG[02-09|08:44:16.166] Fetching batch of transactions           id=5728cd052f1995aa conn=inbound               count=6                                                     
Feb 09 08:44:16 DS7959 geth[652106]: DEBUG[02-09|08:44:16.204] Dropping transaction announcement        id=fd3697620fd7be4e conn=inbound               count=1                                                     
Feb 09 08:44:16 DS7959 geth[652106]: DEBUG[02-09|08:44:16.204] Transaction broadcast                    txs=1    announce packs=55 announced hashes=55    tx packs=7  broadcast txs=7                              
Feb 09 08:44:16 DS7959 geth[652106]: DEBUG[02-09|08:44:16.209] Dropping transaction announcement        id=fd3697620fd7be4e conn=inbound               count=1                                                     
Feb 09 08:44:16 DS7959 geth[652106]: DEBUG[02-09|08:44:16.210] Transaction broadcast                    txs=1    announce packs=60 announced hashes=60    tx packs=8  broadcast txs=8                              
Feb 09 08:44:16 DS7959 geth[652106]: DEBUG[02-09|08:44:16.213] Dropping transaction announcement        id=fd3697620fd7be4e conn=inbound               count=1                                                     
Feb 09 08:44:16 DS7959 geth[652106]: DEBUG[02-09|08:44:16.213] Transaction broadcast                    txs=1    announce packs=55 announced hashes=55    tx packs=7  broadcast txs=7                              
Feb 09 08:44:16 DS7959 geth[652106]: DEBUG[02-09|08:44:16.242] Dropping transaction announcement        id=fd3697620fd7be4e conn=inbound               count=1                                                     
Feb 09 08:44:16 DS7959 geth[652106]: DEBUG[02-09|08:44:16.242] Transaction broadcast                    txs=1    announce packs=63 announced hashes=63    tx packs=8  broadcast txs=8                              
Feb 09 08:44:16 DS7959 geth[652106]: DEBUG[02-09|08:44:16.281] Dropping transaction announcement        id=fd3697620fd7be4e conn=inbound               count=6                       
...

then another block, after 6-7 sec after the previos one, accompanied by a weird piece of logs:

...
Feb 09 08:44:24 DS7959 geth[652106]: INFO [02-09|08:44:24.325] Imported new chain segment               blocks=1  txs=180  mgas=35.676  elapsed=6.090s    mgasps=5.858   number=15,100,523 hash=88da75..b40cda dirt
y=154.30MiB                                                                                                                                                                                                        
Feb 09 08:44:24 DS7959 geth[652106]: DEBUG[02-09|08:44:24.325] Importing propagated block               peer=70db7bd71bc6abada24e58d4e14f20af96156bfe6893dee00d8ee737b0ee17c4 number=15,100,524 hash=eed9e0..b27198
Feb 09 08:44:24 DS7959 geth[652106]: DEBUG[02-09|08:44:24.325] Ignoring already known block             number=15,100,523 hash=88da75..b40cda                                                                      
Feb 09 08:44:24 DS7959 geth[652106]: DEBUG[02-09|08:44:24.325] Synchronisation terminated               elapsed=5.318s                                                                                             
Feb 09 08:44:24 DS7959 geth[652106]: DEBUG[02-09|08:44:24.325] Synchronising with the network           peer=0b767b43fe0eb590aa5e32ba468f9025393263a3f7f48843be8e6f3a9d7fbddc eth=67 head=eed9e0..b27198 td=30,008,
708 mode=full                                                                                                                                                                                                      
...

what is Synchronisation terminated elapsed=5.318s ?

kawahwookiee avatar Feb 09 '22 09:02 kawahwookiee

happens every 10 minutes

kawahwookiee avatar Feb 09 '22 09:02 kawahwookiee

here is another piece, where Downloader queue message accompanies a 30sec lag:

│Feb 09 10:50:57 DS7959 geth[676510]: INFO [02-09|10:50:57.179] Imported new chain segment               blocks=1  txs=149  mgas=26.322  elapsed=73.092ms  mgasps=360.118  number=15,103,056 hash=f2ef13..1b58c3 dirty=591.94MiB                                                                                                                          ┤
│Feb 09 10:51:04 DS7959 geth[676510]: INFO [02-09|10:51:04.541] Downloader queue stats                   receiptTasks=0 blockTasks=0 itemSize=54.07KiB throttle=1213                                                                                                                                                                                      │
│Feb 09 10:51:30 DS7959 geth[676510]: INFO [02-09|10:51:30.307] Imported new chain segment               blocks=1  txs=236  mgas=38.189  elapsed=30.152s   mgasps=1.267    number=15,103,057 hash=4b0a2e..64154e dirty=592.08MiB                                                                                                                          │
│Feb 09 10:51:37 DS7959 geth[676510]: INFO [02-09|10:51:37.153] Imported new chain segment               blocks=1  txs=368  mgas=19.991  elapsed=6.845s    mgasps=2.921    number=15,103,058 hash=652336..165ef1 dirty=594.31MiB  ignored=1                                                                                                               │
│Feb 09 10:51:37 DS7959 geth[676510]: INFO [02-09|10:51:37.277] Imported new chain segment               blocks=1  txs=180  mgas=30.009  elapsed=123.050ms mgasps=243.877  number=15,103,059 hash=2b84d3..0e9f21 dirty=383.99MiB                                                                                                                          │
│Feb 09 10:51:37 DS7959 geth[676510]: INFO [02-09|10:51:37.375] Imported new chain segment               blocks=1  txs=181  mgas=26.155  elapsed=97.447ms  mgasps=268.396  number=15,103,060 hash=b03b59..be5540 dirty=386.91MiB                                                                                                                          │
│Feb 09 10:51:37 DS7959 geth[676510]: INFO [02-09|10:51:37.518] Imported new chain segment               blocks=1  txs=194  mgas=30.680  elapsed=143.279ms mgasps=214.129  number=15,103,061 hash=90feae..10ad12 dirty=389.52MiB                                                                                                                          │
│Feb 09 10:51:37 DS7959 geth[676510]: INFO [02-09|10:51:37.606] Imported new chain segment               blocks=1  txs=238  mgas=32.137  elapsed=87.636ms  mgasps=366.708  number=15,103,062 hash=916894..6c5da1 dirty=389.52MiB                                                                                                                          │
│Feb 09 10:51:37 DS7959 geth[676510]: INFO [02-09|10:51:37.680] Imported new chain segment               blocks=1  txs=111  mgas=19.710  elapsed=71.140ms  mgasps=277.052  number=15,103,063 hash=991ca7..9a557d dirty=392.29MiB                                                                                                                          │
│Feb 09 10:51:37 DS7959 geth[676510]: INFO [02-09|10:51:37.835] Imported new chain segment               blocks=1  txs=225  mgas=42.187  elapsed=154.925ms mgasps=272.303  number=15,103,064 hash=143f1a..d7e55e dirty=395.50MiB                                                                                                                          │
│Feb 09 10:51:38 DS7959 geth[676510]: INFO [02-09|10:51:38.084] Imported new chain segment               blocks=3  txs=703  mgas=74.883  elapsed=248.235ms mgasps=301.661  number=15,103,067 hash=ad32f5..a2250d dirty=399.49MiB  ignored=1                                                                                                               ┤
│Feb 09 10:51:38 DS7959 geth[676510]: INFO [02-09|10:51:38.436] Imported new chain segment               blocks=1  txs=159  mgas=28.018  elapsed=351.860ms mgasps=79.627   number=15,103,068 hash=627322..32d4ca dirty=400.88MiB                                                                                                                          ┤
│Feb 09 10:51:38 DS7959 geth[676510]: INFO [02-09|10:51:38.501] Imported new chain segment               blocks=1  txs=370  mgas=23.345  elapsed=65.155ms  mgasps=358.301  number=15,103,069 hash=b1292b..7f419b dirty=400.88MiB  ignored=1                                                                                                               │
│Feb 09 10:51:39 DS7959 geth[676510]: INFO [02-09|10:51:39.257] Imported new chain segment               blocks=1  txs=161  mgas=21.479  elapsed=72.903ms  mgasps=294.623  number=15,103,070 hash=53b79e..134d65 dirty=402.17MiB                                                                                                                          │
│Feb 09 10:51:42 DS7959 geth[676510]: INFO [02-09|10:51:42.379] Imported new chain segment               blocks=1  txs=179  mgas=28.278  elapsed=284.098ms mgasps=99.537   number=15,103,071 hash=2d0cce..73c0f9 dirty=403.82MiB                       

kawahwookiee avatar Feb 09 '22 11:02 kawahwookiee

@KeefeL I have 2 identical (hardware-wise) nodes in the US (1.1.8) and in Germany (1.1.7). German node did not have any isuues up until about a week ago, but now they both experiance these lags

kawahwookiee avatar Feb 09 '22 11:02 kawahwookiee

I have had this problem ever since first running a node back in March 2021. It seems to happen far less frequently if I run fewer nodes, but even then it can still happen. The main smoking gun here is that quite often, after a long lag period of getting no blocks, I'll get a "failed to unregister peer" error message, after which I get all the backlogged blocks at once. It almost seems that the client relies on a single peer for block updates (even if it's connected to several), and if that peer drops, then instead of just sourcing the blocks from the multitude of other peers, it just sits there in limbo for several minutes until that peer finally formally times out and it looks to others for block updates.

ktrout1111 avatar Feb 11 '22 12:02 ktrout1111

@KeefeL any comments on the above?

kawahwookiee avatar Feb 28 '22 12:02 kawahwookiee

@KeefeL I am still having this issue. re-syncing from the latest snapshot had no effect

kawahwookiee avatar Mar 10 '22 09:03 kawahwookiee

Having the same issue as @kawahwookiee with similar log output. 24 Core 256GB mem 25Gbps network NVMe

Is there any update on this @KeefeL ?

jacobpake avatar Apr 14 '22 15:04 jacobpake