go-ethereum icon indicating copy to clipboard operation
go-ethereum copied to clipboard

State heal phase is very slow (not finished after 2 weeks)

Open xrchz opened this issue 4 years ago • 48 comments
trafficstars

System information

Geth version: geth version

Geth
Version: 1.10.4-stable
Git Commit: aa637fd38a379db6da98df0d520fb1c5139a18ce
Git Commit Date: 20210617
Architecture: amd64
Go Version: go1.16.5
Operating System: linux
GOPATH=
GOROOT=go

OS & Version: Linux

Commit hash : (if develop)

Expected behaviour

geth finishes (default=snap) sync in < 1 day

Actual behaviour

geth is still in 'State heal in progress' after more than 2 weeks

Steps to reproduce the behaviour

Run geth

Logs

https://gist.github.com/xrchz/851eca7ecfb0312422875c90b9a86b2b

xrchz avatar Jul 10 '21 20:07 xrchz

I've been eyeing through the logs a bit. I wonder if your IO speed is bad?

Also, there's a lot of this:

WARN [07-03|18:00:40.413] Unexpected trienode heal packet          peer=9f0e8fbf         reqid=6,915,308,639,612,522,441

It indicates that requests time out. And there's a lot of it. No idea why, but @karalabe added a mechanism to scale the request size depending on the performance, not sure if that is in 1.10.4-stable or not. In a rush, can't check right now

holiman avatar Jul 16 '21 11:07 holiman

I don't think the IO speed is bad. Here is an indication (taken while geth is also still running):

fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite
=randrw --rwmixread=75 --directory=<redacted> --aux-path=<redacted>
test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.27
Starting 1 process
Jobs: 1 (f=1): [m(1)][100.0%][r=136MiB/s,w=45.2MiB/s][r=34.9k,w=11.6k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=450154: Fri Jul 16 12:37:52 2021
  read: IOPS=37.9k, BW=148MiB/s (155MB/s)(3070MiB/20719msec)
   bw (  KiB/s): min=90808, max=196112, per=100.00%, avg=151902.44, stdev=24036.61, samples=41
   iops        : min=22702, max=49028, avg=37975.61, stdev=6009.16, samples=41
  write: IOPS=12.7k, BW=49.5MiB/s (51.9MB/s)(1026MiB/20719msec); 0 zone resets
   bw (  KiB/s): min=30840, max=66952, per=100.00%, avg=50762.73, stdev=8052.88, samples=41
   iops        : min= 7710, max=16738, avg=12690.68, stdev=2013.22, samples=41
  cpu          : usr=8.53%, sys=45.38%, ctx=105093, majf=0, minf=8
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=785920,262656,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=148MiB/s (155MB/s), 148MiB/s-148MiB/s (155MB/s-155MB/s), io=3070MiB (3219MB), run=20719-20719msec
  WRITE: bw=49.5MiB/s (51.9MB/s), 49.5MiB/s-49.5MiB/s (51.9MB/s-51.9MB/s), io=1026MiB (1076MB), run=20719-20719msec

Disk stats (read/write):
  sda: ios=783034/261889, merge=32/32, ticks=870354/151608, in_queue=1021965, util=99.63%

xrchz avatar Jul 16 '21 12:07 xrchz

What is the machine specs?

begetan avatar Jul 23 '21 11:07 begetan

8GB RAM, 4x 2.2GHz vCPUs (AMD EPYC 7601)

xrchz avatar Jul 23 '21 22:07 xrchz

I have the same issue, 2 weeks+ and still not done "State heal in progress" Geth version:

Geth
Version: 1.10.6-stable
Architecture: amd64
Go Version: go1.16.6
Operating System: darwin
GOPATH=
GOROOT=go

logs snipset:

WARN [07-28|10:29:46.515] Pivot became stale, moving               old=12,914,818 new=12,914,882
INFO [07-28|10:29:50.427] Imported new block headers               count=1    elapsed=253.434ms number=12,914,945 hash=ee1a08..9ce38a
INFO [07-28|10:30:00.224] Imported new block receipts              count=64   elapsed=13.703s   number=12,914,881 hash=fef964..d789fc age=18m5s     size=7.69MiB
INFO [07-28|10:30:18.658] Imported new block headers               count=1    elapsed=46.715ms  number=12,914,946 hash=7b24c8..2d8006
INFO [07-28|10:30:21.965] State heal in progress                   accounts=169,[email protected]  [email protected]    [email protected] nodes=43,293,[email protected] pending=112,626
INFO [07-28|10:30:21.969] State heal in progress                   accounts=169,[email protected]  [email protected]    [email protected] nodes=43,293,[email protected] pending=112,626
WARN [07-28|10:30:22.014] Unexpected trienode heal packet          peer=cef48f42 reqid=6,500,665,912,563,462,364
WARN [07-28|10:30:22.014] Unexpected trienode heal packet          peer=df5f16fe reqid=2,588,354,188,352,030,655
WARN [07-28|10:30:22.081] Unexpected trienode heal packet          peer=07527e24 reqid=2,299,545,781,862,247,020
WARN [07-28|10:30:22.097] Unexpected trienode heal packet          peer=06fbcb3d reqid=4,616,793,971,286,455,283
WARN [07-28|10:30:22.117] Unexpected trienode heal packet          peer=f26bae89 reqid=5,492,659,643,994,326,497
WARN [07-28|10:30:22.150] Unexpected trienode heal packet          peer=8fe0dc37 reqid=4,841,696,059,284,633,551
WARN [07-28|10:30:22.166] Unexpected trienode heal packet          peer=d8c491e0 reqid=5,089,512,569,912,806,296
WARN [07-28|10:30:22.214] Unexpected trienode heal packet          peer=ab94eac9 reqid=289,535,681,122,254,882
INFO [07-28|10:30:29.698] Imported new block headers               count=1    elapsed=50.129ms  number=12,914,947 hash=502883..e4a4cf
INFO [07-28|10:30:40.282] Imported new block headers               count=1    elapsed=68.952ms  number=12,914,948 hash=ca7379..feb0e4
INFO [07-28|10:30:42.053] Downloader queue stats                   receiptTasks=0     blockTasks=0     itemSize=246.41KiB throttle=1064
INFO [07-28|10:30:51.755] Imported new block headers               count=1    elapsed=455.293ms number=12,914,949 hash=69f8b7..c0be19
INFO [07-28|10:31:23.127] Imported new block headers               count=1    elapsed=36.240ms  number=12,914,950 hash=1c41d1..32309f
INFO [07-28|10:31:51.161] Imported new block headers               count=2    elapsed=303.134ms number=12,914,952 hash=05fedd..bd977f age=1m4s
INFO [07-28|10:31:53.863] Downloader queue stats                   receiptTasks=0     blockTasks=0     itemSize=251.06KiB throttle=1045
INFO [07-28|10:31:58.571] Imported new block headers               count=1    elapsed=50.721ms  number=12,914,953 hash=ade2c1..0a5bc0
INFO [07-28|10:32:06.890] Imported new block headers               count=1    elapsed=635.477ms number=12,914,954 hash=804d5c..ab7b25
INFO [07-28|10:32:10.395] Imported new block headers               count=2    elapsed=488.349ms number=12,914,956 hash=1e8a8f..3e8b40
INFO [07-28|10:32:48.525] Imported new block headers               count=1    elapsed=80.333ms  number=12,914,957 hash=e45180..9b6220
INFO [07-28|10:32:59.615] Imported new block headers               count=1    elapsed=47.949ms  number=12,914,958 hash=d709af..d8cf88
INFO [07-28|10:33:00.748] Downloader queue stats                   receiptTasks=0     blockTasks=0     itemSize=236.64KiB throttle=1108
INFO [07-28|10:33:07.879] State heal in progress                   accounts=169,[email protected]  [email protected]    [email protected] nodes=43,293,[email protected] pending=4369
INFO [07-28|10:33:14.157] Imported new block headers               count=1    elapsed=66.366ms  number=12,914,959 hash=b48f74..bf44f3 age=1m6s
INFO [07-28|10:33:17.898] Imported new block headers               count=1    elapsed=34.263ms  number=12,914,960 hash=0bd389..15919a
INFO [07-28|10:33:22.825] State heal in progress                   accounts=169,[email protected]  [email protected]    [email protected] nodes=43,293,[email protected] pending=5004
INFO [07-28|10:33:28.418] Imported new block headers               count=1    elapsed=36.038ms  number=12,914,961 hash=71baf1..463032
INFO [07-28|10:33:35.198] State heal in progress                   accounts=169,[email protected]  [email protected]    [email protected] nodes=43,293,[email protected] pending=5484
INFO [07-28|10:33:36.183] Imported new block headers               count=1    elapsed=53.880ms  number=12,914,962 hash=d3a956..4d2c35
INFO [07-28|10:34:04.801] Imported new block headers               count=1    elapsed=49.827ms  number=12,914,963 hash=3a9efb..280236
INFO [07-28|10:34:05.954] Downloader queue stats                   receiptTasks=0     blockTasks=0     itemSize=249.31KiB throttle=1052
INFO [07-28|10:34:11.899] Imported new block headers               count=1    elapsed=51.650ms  number=12,914,964 hash=b03469..eed621

Any clue how much longer based on logs? total number of nodes/etc?

bFanek avatar Jul 28 '21 14:07 bFanek

The same issue. IO is more than enough. Geth version

Geth
Version: 1.10.6-stable
Git Commit: 576681f29b895dd39e559b7ba17fcd89b42e4833
Git Commit Date: 20210722
Architecture: amd64
Go Version: go1.16.4
Operating System: linux
GOPATH=
GOROOT=go

machine

CPU: AMD EPYC 7R32 @ 8x 3.224GHz
RAM: 3485MiB / 15853MiB

eth_syncing

{
  "jsonrpc": "2.0",
  "id": "curltest",
  "result": {
    "currentBlock": "0xc5a113",
    "highestBlock": "0xc5a15d",
    "knownStates": "0x2b8c155a",
    "pulledStates": "0x2b8c155a",
    "startingBlock": "0xc595dd"
  }
}

The difference between currentBlock and currentBlock is always about 50~150.

kraymond37 avatar Aug 03 '21 11:08 kraymond37

@bFanek Your node is downloading 2 trie nodes per second (1 trie node is 500 bytes). On a different log, it says writing 7MB to the database took 14 seconds. Something's very off there. What kind of machine are you running Geth on?

karalabe avatar Aug 03 '21 12:08 karalabe

@kraymond37 You'll need to share some logs. How fast are trie nodes being downloaded?

karalabe avatar Aug 03 '21 12:08 karalabe

@xrchz The logs you shared show the initial snap sync, even at that, only the beginning. You say you've been running state heal for weeks, but no such thing appears in the logs you've shared.

karalabe avatar Aug 03 '21 12:08 karalabe

@xrchz The logs you shared show the initial snap sync, even at that, only the beginning. You say you've been running state heal for weeks, but no such thing appears in the logs you've shared.

@karalabe Gist automatically truncates large files. You need to click "view the full file" at the top of the gist page to see it all.

xrchz avatar Aug 03 '21 13:08 xrchz

Oh, doh. Apologies.

On Tue, Aug 3, 2021 at 4:36 PM Ramana Kumar @.***> wrote:

@xrchz https://github.com/xrchz The logs you shared show the initial snap sync, even at that, only the beginning. You say you've been running state heal for weeks, but no such thing appears in the logs you've shared.

@karalbe Gist automatically truncates large files. You need to click "view the full file" at the top of the gist page to see it all.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ethereum/go-ethereum/issues/23191#issuecomment-891853015, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA7UGOTSIPXIMKKEJJFEYTT27WFLANCNFSM5AEREQXA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

karalabe avatar Aug 03 '21 13:08 karalabe

@karalabe Finally finished. But I don't know why it is so slow. Here is the log.

kraymond37 avatar Aug 04 '21 01:08 kraymond37

I've got the same issue. Syncing mainnet and rinkeby, both nodes have been running state heal for 2-3 weeks now.

j4ys0n avatar Aug 09 '21 16:08 j4ys0n

Is there a resolution to this?

j4ys0n avatar Aug 12 '21 18:08 j4ys0n

Similar problem here ...

I'm ~8 days in now ... the state sync took ~2 days (including some downtimes while I was setting up the rest of the server). The other 6 days it's been state healing.

RaspberryPi ARM-64 (8 GiB RAM, overclocked @ 4 x 2,100 Mhz) Disk: SSD (200+ MiB / sec) Net: 2 Gbit/s

Not sure if relevant, but I had >40,000 of the below messages in the 6 days it's been state healing: WARN [08-22|00:03:07.286] Unexpected trienode heal packet peer=XYZ reqid=XYZ

The Geth folder currently utilizes 400 GiB (no archive).

xanoni avatar Aug 24 '21 04:08 xanoni

FYI, I posted more detail in the other ticket: https://github.com/ethereum/go-ethereum/issues/23228

EDIT:

I switched to fast sync (maintaining the already downloaded data), but it was still not done after 2-3 days. Thus, I rsync'd the whole geth folder to an aarch64 machine and completed the snap sync there, which took 1-2 days. Then I rsync'd everything back and it's running smoothly on the Pi now.

Not ideal, but worked for me.

xanoni avatar Aug 24 '21 05:08 xanoni

@xrchz Have u solved this problem? Im facing the same problem as yours. Can u tell me whats going on in the following days? thx! (Or can i switch to the fast mode without delet the data i have downloaded? )

lrc123456 avatar Sep 14 '21 03:09 lrc123456

No I have not solved it. I stopped trying to use geth myself and switched to using full node service providers instead.

xrchz avatar Sep 14 '21 05:09 xrchz

No I have not solved it. I stopped trying to use geth myself and switched to using full node service providers instead.

sad news :(
I don't know much about full node service providers, which one did u choose? My purpose is to get full node data of ethereum by syncing with geth, and then use my modified geth client for scientific tests, do u think full node service providers works for me? Can i get the same "datadir" folder using it? Thx again!!

lrc123456 avatar Sep 14 '21 13:09 lrc123456

try the other sync options maybe one works better

xanoni avatar Sep 14 '21 18:09 xanoni

try the other sync options maybe one works better

Can i just interrupt the snapsync with ctr+c and use fastsync to continue with the same data folder? (Don't want to lose what i have got the last week)

lrc123456 avatar Sep 15 '21 01:09 lrc123456

try the other sync options maybe one works better

Can i just interrupt the snapsync with ctr+c and use fastsync to continue with the same data folder? (Don't want to lose what i have got the last week)

When I did this recently, it re-used the already downloaded chaindata (~470 GiB). Not sure about the rest.

xanoni avatar Sep 15 '21 03:09 xanoni

try the other sync options maybe one works better

Can i just interrupt the snapsync with ctr+c and use fastsync to continue with the same data folder? (Don't want to lose what i have got the last week)

When I did this recently, it re-used the already downloaded chaindata (~470 GiB). Not sure about the rest.

thx! BTW i noticed resync didnt work for u right? 😂

lrc123456 avatar Sep 15 '21 03:09 lrc123456

try the other sync options maybe one works better

Can i just interrupt the snapsync with ctr+c and use fastsync to continue with the same data folder? (Don't want to lose what i have got the last week)

When I did this recently, it re-used the already downloaded chaindata (~470 GiB). Not sure about the rest.

thx! BTW i noticed resync didnt work for u right? 😂

I didn't have time to test a lot of things, so I just moved the downloaded chaindata to an x86-64 machine and did a snap sync there ... which worked. Then I transferred it back to the ARM and it's been running smoothly since.

xanoni avatar Sep 16 '21 02:09 xanoni

How can i increase the throttle?

Hailiebaby16 avatar Sep 29 '21 09:09 Hailiebaby16

Hello everyone, as i faced this issue i started my geth with inaccessible ancient store, cause of faulty permissions.

The snap sync was used per default. While running i did another mistake and my system partition runs out of disk space. After this got solved, the 4GB of RAM exceeded and the swap failed, system halts.

So we can expect a corrupted database at this point?

After this got fixed i observed the behaviour described, a lot of "Unexpected trienode heal packet".

My Port Forwarding was also missing, but this seems not to fix it yet.

IPv6 was enabled, i disabled it and started with "--syncmode fast".

The "Unexpected trienode heal packet" disappeared from logs and its syncing.

System: AMD Ryzen in a Virtualbox attached to a USB 3.0 nVME running Ubuntu Server 20 (4vCPU, 6GB RAM, 6GB swap)

ChuckNorrison avatar Nov 21 '21 08:11 ChuckNorrison

Hi.

A 8 GB Raspberry Pi 4 is able to complete a snap sync in less than 2 days. Please see this:

https://twitter.com/EthereumOnARM/status/1463796036103770114

And check these points:

https://twitter.com/EthereumOnARM/status/1463828024080707585

diglos avatar Nov 25 '21 11:11 diglos

Hi I've also been trying to snap sync a geth node in aws and have also been stuck in state healing for a few weeks. I've synced nodes locally before within a few days so this was surprising to me. I'm using nvme storage on the aws node so I would expect it has the iops to sync.

A log snippet:

INFO [03-08|21:10:17.620] State heal in progress                   accounts=1,549,[email protected] slots=4,125,[email protected] [email protected] nodes=163,743,[email protected] pending=85071
WARN [03-08|21:10:17.622] Unexpected trienode heal packet          peer=73755973 reqid=4,134,898,869,596,324,111
WARN [03-08|21:10:17.633] Unexpected trienode heal packet          peer=25f05f19 reqid=7,011,098,481,708,786,349
WARN [03-08|21:10:17.671] Unexpected trienode heal packet          peer=9c8ce4ee reqid=7,218,003,286,183,975,893
WARN [03-08|21:10:17.674] Unexpected trienode heal packet          peer=bcddcc5d reqid=1,911,971,051,676,565,964
INFO [03-08|21:10:22.508] Imported new block headers               count=1    elapsed=8.076ms   number=14,348,482 hash=de4c70..134896
INFO [03-08|21:10:31.658] Imported new block headers               count=1    elapsed=7.743ms   number=14,348,483 hash=809a3e..684cf6
INFO [03-08|21:10:33.379] State heal in progress                   accounts=1,549,[email protected] slots=4,125,[email protected] [email protected] nodes=163,743,[email protected] pending=85505
INFO [03-08|21:10:47.765] State heal in progress                   accounts=1,550,[email protected] slots=4,125,[email protected] [email protected] nodes=163,743,[email protected] pending=85906

I have seen others have solved this by switching to fast sync mode but this is no longer an option on the geth version i'm using (1.10.16-stable).

0xadnauseam avatar Mar 08 '22 21:03 0xadnauseam

Geth on my Raspberry Pi 4 8 GB has also continuously been reporting "State heal in progress" and frequent "Unexpected trienode heal packet" log messages for a few weeks after I deleted the Geth database and restarted the sync from the beginning. What do these messages mean? Is my Geth node still syncing? Should I bother to continue running my Geth node until this issue is resolved?

derekmahar avatar Mar 14 '22 16:03 derekmahar

Had the same and I decided to synch on a different (faster) computer instead ... I then rsync'd the blockchain data to the RPi after the sync was complete. Not an ideal solution, but worked. Might have mentioned this before somewhere in the tickets here, sorry if that's the case.

xanoni avatar Mar 16 '22 02:03 xanoni

I'm also struggling with this problem, and I'm also running Geth on a Raspberry PI 4 and an external SSD.

As I just found out, my SSD has quite low IOPS (I'm using WD Blue 1Tb). I'm following these instructions to test it: https://docs.rocketpool.net/guides/node/local/prepare-pi.html#testing-the-ssd-s-performance

Not sure what's the cause. Maybe it's an outdated firmware or a bad SATA-to-USB adapter. Anyway, it's not a Geth bug in my case.

Jeiwan avatar Mar 25 '22 13:03 Jeiwan

I'm also struggling with this problem, and I'm also running Geth on a Raspberry PI 4 and an external SSD.

As I just found out, my SSD has quite low IOPS (I'm using WD Blue 1Tb). I'm following these instructions to test it: https://docs.rocketpool.net/guides/node/local/prepare-pi.html#testing-the-ssd-s-performance

I'm also using a WD Blue 1 TB with my Raspberry Pi 4 8 GB and its IO performance (read: IOPS=2284, BW=9139KiB/s, write: IOPS=763, BW=3054KiB/s) is much lower than that which the Rocketpool guide recommends. However, Geth was running without errors or warnings on this hardware until a few point versions prior to 1.10.16. Why would it encounter these problems now?

derekmahar avatar Mar 25 '22 19:03 derekmahar

My node has finally finished syncing! It took about 5 days, with the state healing phase taking about half of that.

My setup:

  1. Raspberry Pi 4 Model B Rev 1.5, 8 Gb.
  2. WD Blue 1Tb connected via a cheap SATA-to-USB adapter. IOPS and bandwidth seem to be low (geth was not running during the test):
test: (groupid=0, jobs=1): err= 0: pid=1820: Fri Mar 25 13:19:16 2022
  read: IOPS=3043, BW=11.9MiB/s (12.5MB/s)(3070MiB/258259msec)
   bw (  KiB/s): min= 4728, max=14672, per=99.99%, avg=12170.61, stdev=1868.41, samples=516
   iops        : min= 1182, max= 3668, avg=3042.62, stdev=467.11, samples=516
  write: IOPS=1017, BW=4068KiB/s (4166kB/s)(1026MiB/258259msec); 0 zone resets
   bw (  KiB/s): min= 1544, max= 5104, per=99.99%, avg=4067.59, stdev=651.52, samples=516
   iops        : min=  386, max= 1276, avg=1016.85, stdev=162.86, samples=516
  cpu          : usr=3.86%, sys=24.14%, ctx=1048480, majf=0, minf=19
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=785920,262656,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=11.9MiB/s (12.5MB/s), 11.9MiB/s-11.9MiB/s (12.5MB/s-12.5MB/s), io=3070MiB (3219MB), run=258259-258259msec
  WRITE: bw=4068KiB/s (4166kB/s), 4068KiB/s-4068KiB/s (4166kB/s-4166kB/s), io=1026MiB (1076MB), run=258259-258259msec

I'd say that the first phase (blocks downloading) is limited by network. I didn't have many peers so it took about 2.5 days to finish. Incoming traffic was between 2 and 4 Mb/s. The state healing phase seems to be limited by CPU and IOPS. My RPI is not overclocked (some recommend overclocking it), and average read IOPS had been around 1.3K during the phase (seems underperfomed? maybe because of CPU?). Read bandwidth had been around 11MiB/s, which matches the number from the test.

I posted some statistics to my Twitter: https://twitter.com/jeiwan7/status/1507565655079686145

These were last "State heal in progress" lines in logs (the node was restarted a couple of times):

         accounts=470,[email protected] slots=883,[email protected] [email protected] nodes=40,140,[email protected] pending=10152
         accounts=470,[email protected] slots=884,[email protected] [email protected] nodes=40,141,[email protected] pending=8943
         accounts=470,[email protected] slots=884,[email protected] [email protected] nodes=40,143,[email protected] pending=7697
         accounts=470,[email protected] slots=885,[email protected] [email protected] nodes=40,145,[email protected] pending=6533
         accounts=470,[email protected] slots=886,[email protected] [email protected] nodes=40,148,[email protected] pending=4126
         accounts=470,[email protected] slots=886,[email protected] [email protected] nodes=40,149,[email protected] pending=2891
         accounts=470,[email protected] slots=887,[email protected] [email protected] nodes=40,151,[email protected] pending=149
         accounts=470,[email protected] slots=887,[email protected] [email protected] nodes=40,151,[email protected] pending=0

Seems that pending needs to get down to 0 for the phase to be finished.


Geth
Version: 1.10.16-stable
Git Commit: 20356e57b119b4e70ce47665a71964434e15200d
Git Commit Date: 20220215
Architecture: arm64
Go Version: go1.17.5
Operating System: linux
GOPATH=
GOROOT=go

Jeiwan avatar Mar 26 '22 04:03 Jeiwan

Hi, I'm experiencing the same problem now. I had an unclean shutdown three days ago, node was offline for two hours, and it seems that it has not yet recovered.

  • Still State heal in progress
  • Lots of Unexpected trienode heal packet
  • pending seems to go up every time

I have been running geth on this machine for more than a year. SSD stats:

   READ: bw=112MiB/s (118MB/s), 112MiB/s-112MiB/s (118MB/s-118MB/s), io=3070MiB (3219MB), run=27321-27321msec
  WRITE: bw=37.6MiB/s (39.4MB/s), 37.6MiB/s-37.6MiB/s (39.4MB/s-39.4MB/s), io=1026MiB (1076MB), run=27321-27321msec

geth version 1.10.16-stable

Is there something I should do to speed it up? Maybe increase bandwidth size?

I'm running geth with default values for everything:

geth --mainnet --datadir ... --http --http.api eth,net,web3

newearthmartin avatar Jul 23 '22 13:07 newearthmartin

You need v1.10.19 or later in order to sync. 

holiman avatar Jul 23 '22 15:07 holiman

Thanks, I upgraded the geth version but the problem still remains.

newearthmartin avatar Jul 23 '22 19:07 newearthmartin

Thanks, I upgraded the geth version but the problem still remains.

@newearthmartin run iostat -d -x -m 10 command on your Linux box. The column r/s (read requests per second) must be about 1500-2000 and the latency (column r_await) must be about 80 microseconds , i.e. 0.08) If it is lower, then you are going to state-heal for weeks

nuliknol avatar Jul 27 '22 16:07 nuliknol

Thanks, I upgraded the geth version but the problem still remains.

@newearthmartin if it happens that you have got a slow SSD disk, you can accelerate it buying a smaller but fast SSD disk and configure it as caching device for the slow SSD disk via bcache kernel module.

nuliknol avatar Jul 27 '22 16:07 nuliknol

@nuliknol Here are some values from iostat while running geth:

r/s         r_await
1980.40     0.32
1953.30     0.33
1732.00     0.32
1654.20     0.35
2095.00     0.33

Does this mean r/s is ok but latency is not? Any way that I can improve this? I'm using a 2TB WD Blue on a 2016 8GB Asus Zenbook laptop running Ubuntu 22.04.

What I don't understand is that I've been running a full node for a year, why will geth not recover from this? You say weeks... when I downloaded the blockchain, it took less...

Also, should I be looking into Erigon for this hardware? :) I really would like to stay with Geth.

newearthmartin avatar Jul 27 '22 17:07 newearthmartin

Is there a way to see the progress of the state heal phase and how more it has to go? Also, if I restart geth, do I lose the progress?

newearthmartin avatar Jul 27 '22 21:07 newearthmartin

Hello, I'm experiencing the same issue.

I ran the Rocketpool test and results are awful

`test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64 fio-3.16 Starting 1 process test: Laying out IO file (1 file / 4096MiB) Jobs: 1 (f=1): [m(1)][99.9%][r=276KiB/s,w=88KiB/s][r=69,w=22 IOPS][eta 00m:05s] test: (groupid=0, jobs=1): err= 0: pid=384253: Mon Aug 1 13:14:52 2022 read: IOPS=210, BW=842KiB/s (863kB/s)(3070MiB/3731597msec) bw ( KiB/s): min= 8, max=81176, per=100.00%, avg=879.52, stdev=3872.75, samples=7137 iops : min= 2, max=20294, avg=219.84, stdev=968.19, samples=7137 write: IOPS=70, BW=282KiB/s (288kB/s)(1026MiB/3731597msec); 0 zone resets bw ( KiB/s): min= 8, max=27040, per=100.00%, avg=294.19, stdev=1298.58, samples=7130 iops : min= 2, max= 6760, avg=73.50, stdev=324.65, samples=7130 cpu : usr=0.12%, sys=0.43%, ctx=76826, majf=0, minf=8 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued rwts: total=785920,262656,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs): READ: bw=842KiB/s (863kB/s), 842KiB/s-842KiB/s (863kB/s-863kB/s), io=3070MiB (3219MB), run=3731597-3731597msec WRITE: bw=282KiB/s (288kB/s), 282KiB/s-282KiB/s (288kB/s-288kB/s), io=1026MiB (1076MB), run=3731597-3731597msec

Disk stats (read/write): sda: ios=819211/295271, merge=6718/35856, ticks=151863026/55438750, in_queue=208094988, util=97.34% `

This strikes a bit strange to me since I had previously synced geth successfully on this drive.

This are the specs for the SSD Device Model: HP SSD S650 1920GB Serial Number: HBSA51350100117 Firmware Version: T1030A0 User Capacity: 1 920 383 410 176 bytes [1.92 TB] Sector Size: 512 bytes logical/physical Rotation Rate: Solid State Device Form Factor: 2.5 inches Device is: Not in smartctl database [for details use: -P showall] ATA Version is: ACS-3 T13/2161-D revision 4 SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Mon Aug 1 13:52:08 2022 CDT SMART support is: Available - device has SMART capability. SMART support is: Enabled AAM feature is: Unavailable APM level is: 254 (maximum performance) Rd look-ahead is: Enabled Write cache is: Enabled DSN feature is: Unavailable ATA Security is: Disabled, frozen [SEC2] Wt Cache Reorder: Unavailable I don't know how to proceed. Maybe there's some erroneous configuration on my part. I'm running geth 1.10.20

6ilcarlos avatar Aug 01 '22 18:08 6ilcarlos

Hello @Jeiwan !

Thanks for your help. I was happy to see that you could sync using a Raspberry Pi. I have a much more powerful equipment and was about to give up.

nodes=40,151,[email protected]

For me, your message showing the total nodes when you finished to sync was the most useful here. All I wanted to know is if my syncing was progressing or not.

Can you update us with the total number of nodes today?

I am still syncing. I hope you're not much further than 40,151,717. I am at about 8 million nodes now.

Thanks. :)

machinasapiens avatar Aug 13 '22 11:08 machinasapiens

Hi @machinasapiens!

I don't have access to my RPI at the moment so I cannot check that. I had multiple full re-syncs on that same RPI and SSD and each of them went fine.

As to the number of nodes, I think it gets reset after each restart (but progress is not lost).

As far as I understand the synchronization process, there are two stages: during the first stage, it downloads block headers; during the second stage, it reconstruct the account state. The length of the latter cannot be estimated because the account state is a Patricia Trie (which gets updated while your node is syncing). So you just need to wait 🤷‍♂️ If it keeps utilizing CPU and memory, everything's fine.

Jeiwan avatar Aug 13 '22 18:08 Jeiwan

Hi @machinasapiens!

I don't have access to my RPI at the moment so I cannot check that. I had multiple full re-syncs on that same RPI and SSD and each of them went fine.

As to the number of nodes, I think it gets reset after each restart (but progress is not lost).

As far as I understand the synchronization process, there are two stages: during the first stage, it downloads block headers; during the second stage, it reconstruct the account state. The length of the latter cannot be estimated because the account state is a Patricia Trie (which gets updated while your node is syncing). So you just need to wait 🤷‍♂️ If it keeps utilizing CPU and memory, everything's fine.

Excellent information. Thanks a lot. 🙌

machinasapiens avatar Aug 13 '22 18:08 machinasapiens