go-ethereum FYI - Sync mode SNAP does not work

I run several geth nodes, but I've always used Sync mode fast. I am building a new machine, so I figured I would try snap (the new default). This is running on a HP DL 380 G5 with 24 GB of RAM and a 1TB SSD.

Using SNAP sync, geth will go into these modes of compacting the database and it completely halts the sync process. It does this for over an hour sometimes. But, I let it continue. Now, this morning my sync appears to be stuck in "state heal in progress". It looks like it's been at this for hours. I give up. I am going back to Fast.

Jul 17 '21 13:07 ryny24

Compacting means geth is writing data faster than leveldb can handle, so it pushes back a bit. It's not a fault, it's the fact that snap sync delivers data a lot faster than 'fast'.

Jul 17 '21 14:07 holiman

I am having the same issue, ever since the latest update, i cannot sync my node completely 06D48D59-6ADB-4878-9EBF-C0E72B7AF92D

i will go back to fast sync or should I wipe db and try again? Is there anything i can do to help get more info for geth to understand why?

Jul 17 '21 18:07 mohamedmansour

The nodes-count is increasing -- it's progressing.

Jul 17 '21 18:07 holiman

The unexpected trienode heal packet is that normal? That is like 90% of my logs at this time, if it is normal, should we make it into an "INFO" instead.

Jul 17 '21 18:07 mohamedmansour

No, that's not an expected error (not that many of them)

Jul 17 '21 19:07 holiman

This is my second time doing a fresh sync, and I am getting many Unexpected trienode heal packet "warnings", it seems we are getting a lot of them, anything incorrect with my setup that would lead to this?

WARN [07-18|16:45:26.500] Unexpected trienode heal packet          peer=a0378d3b         reqid=4,622,035,716,730,946,788
WARN [07-18|16:45:26.565] Unexpected trienode heal packet          peer=d9d3b409         reqid=5,186,876,070,523,757,744
WARN [07-18|16:45:26.784] Unexpected trienode heal packet          peer=5532d114         reqid=4,913,062,635,039,342,268
WARN [07-18|16:45:27.002] Unexpected trienode heal packet          peer=765491aa         reqid=8,537,381,729,530,772,935
WARN [07-18|16:45:27.134] Unexpected trienode heal packet          peer=17935a2e         reqid=398,999,258,026,622,278
WARN [07-18|16:45:27.285] Unexpected trienode heal packet          peer=fb395765         reqid=2,540,630,659,186,956,989
WARN [07-18|16:45:27.394] Unexpected trienode heal packet          peer=6d63f63c         reqid=3,060,663,354,523,276,049
WARN [07-18|16:45:29.757] Unexpected trienode heal packet          peer=a8523f94         reqid=9,026,963,612,786,922,325
WARN [07-18|16:45:29.859] Unexpected trienode heal packet          peer=889cc785         reqid=1,566,881,240,675,074,312
WARN [07-18|16:45:30.026] Unexpected trienode heal packet          peer=a6d04daf         reqid=2,158,928,981,022,095,841
WARN [07-18|16:45:30.152] Unexpected trienode heal packet          peer=9e9cac22         reqid=3,592,232,586,991,666,835
WARN [07-18|16:45:31.019] Unexpected trienode heal packet          peer=5d8589a4         reqid=393,438,786,560,958,185
WARN [07-18|16:45:31.181] Unexpected trienode heal packet          peer=91bbf60e         reqid=4,523,948,219,739,688,930
WARN [07-18|16:45:31.372] Unexpected trienode heal packet          peer=d775bde9         reqid=4,708,330,904,872,538,623
WARN [07-18|16:45:32.379] Unexpected trienode heal packet          peer=c4532ff9         reqid=6,874,370,675,488,852,123
WARN [07-18|16:45:32.380] Unexpected trienode heal packet          peer=4fa42e20         reqid=3,267,196,924,822,701,039
WARN [07-18|16:45:32.542] Unexpected trienode heal packet          peer=f1b13a02         reqid=884,270,897,609,630,641
INFO [07-18|16:45:43.977] State heal in progress                   [email protected]        [email protected]       [email protected]       nodes=397,[email protected] pending=123,495
WARN [07-18|16:45:43.978] Unexpected trienode heal packet          peer=5d664149         reqid=3,710,167,488,769,797,833
WARN [07-18|16:45:43.981] Unexpected trienode heal packet          peer=e3dbfe41         reqid=2,073,374,780,961,819,677
WARN [07-18|16:45:43.983] Unexpected trienode heal packet          peer=bb71328b         reqid=8,121,458,505,920,535,863
WARN [07-18|16:45:43.985] Unexpected trienode heal packet          peer=a1ea248b         reqid=267,783,068,321,922,388
WARN [07-18|16:45:43.991] Unexpected trienode heal packet          peer=a5201f7c         reqid=1,439,199,076,782,710,385
WARN [07-18|16:45:44.056] Unexpected trienode heal packet          peer=61551118         reqid=3,545,386,287,666,985,539
WARN [07-18|16:45:44.056] Unexpected trienode heal packet          peer=85c41b4d         reqid=2,200,301,475,214,792,901
WARN [07-18|16:45:44.057] Unexpected trienode heal packet          peer=cb4a957a         reqid=151,712,146,799,871,033
WARN [07-18|16:45:44.062] Unexpected trienode heal packet          peer=8320ab75         reqid=7,465,864,124,670,898,925
WARN [07-18|16:45:44.064] Unexpected trienode heal packet          peer=3319911a         reqid=1,676,101,697,579,000,381
WARN [07-18|16:45:44.065] Unexpected trienode heal packet          peer=bc0a22b2         reqid=7,641,458,910,862,074,927
WARN [07-18|16:45:44.125] Unexpected trienode heal packet          peer=3fe6f503         reqid=7,863,934,179,564,415,364
WARN [07-18|16:45:44.129] Unexpected trienode heal packet          peer=22350643         reqid=4,517,462,049,592,695,959
WARN [07-18|16:45:44.130] Unexpected trienode heal packet          peer=6db8fffa         reqid=3,966,230,954,419,389,835
WARN [07-18|16:45:44.452] Unexpected trienode heal packet          peer=b6dc91f8         reqid=1,988,555,966,599,462,679
WARN [07-18|16:45:44.502] Unexpected trienode heal packet          peer=dd94e72d         reqid=4,450,099,697,928,108,082
WARN [07-18|16:45:44.504] Unexpected trienode heal packet          peer=5b858358         reqid=8,925,946,666,813,154,737
WARN [07-18|16:45:44.509] Unexpected trienode heal packet          peer=23a85aa5         reqid=3,080,196,199,118,744,386
WARN [07-18|16:45:44.511] Unexpected trienode heal packet          peer=eba98804         reqid=8,349,544,363,420,424,856
WARN [07-18|16:45:44.513] Unexpected trienode heal packet          peer=ea43a6fe         reqid=1,184,655,664,638,635,660
WARN [07-18|16:45:44.516] Unexpected trienode heal packet          peer=12b7db81         reqid=5,793,738,489,717,339,364
WARN [07-18|16:45:44.516] Unexpected trienode heal packet          peer=e70811b4         reqid=2,177,353,076,724,056,341
WARN [07-18|16:45:44.763] Unexpected trienode heal packet          peer=22e17f9c         reqid=8,853,990,770,070,346,560
WARN [07-18|16:45:44.811] Unexpected trienode heal packet          peer=e6382c70         reqid=739,770,737,457,955,743
INFO [07-18|16:45:44.850] Imported new block headers               count=1    elapsed=6.160ms     number=12,851,749 hash=7a3243..6dae55 age=1m23s
WARN [07-18|16:45:44.871] Unexpected trienode heal packet          peer=cd828ba4         reqid=8,798,951,724,798,149,997
WARN [07-18|16:45:44.873] Unexpected trienode heal packet          peer=6e5635a0         reqid=8,705,205,715,673,999,671
WARN [07-18|16:45:44.875] Unexpected trienode heal packet          peer=cbdd12a9         reqid=1,750,396,375,655,342,341
WARN [07-18|16:45:44.876] Unexpected trienode heal packet          peer=9f0aaa30         reqid=4,072,861,921,597,398,313
WARN [07-18|16:45:44.879] Unexpected trienode heal packet          peer=a0378d3b         reqid=8,386,183,962,563,361,933
WARN [07-18|16:45:44.879] Unexpected trienode heal packet          peer=d9d3b409         reqid=8,295,232,788,244,893,801
WARN [07-18|16:45:44.882] Unexpected trienode heal packet          peer=5532d114         reqid=3,623,043,430,561,675,816
WARN [07-18|16:45:44.911] Unexpected trienode heal packet          peer=765491aa         reqid=7,752,667,725,631,141,893
WARN [07-18|16:45:44.958] Unexpected trienode heal packet          peer=17935a2e         reqid=4,857,961,192,605,514,856
WARN [07-18|16:45:45.009] Unexpected trienode heal packet          peer=fb395765         reqid=2,054,719,662,305,818,389
WARN [07-18|16:45:45.058] Unexpected trienode heal packet          peer=6d63f63c         reqid=6,657,913,199,915,039,014
WARN [07-18|16:45:45.081] Unexpected trienode heal packet          peer=a8523f94         reqid=7,638,890,637,230,011,266
WARN [07-18|16:45:45.114] Unexpected trienode heal packet          peer=889cc785         reqid=2,929,668,062,071,838,105
WARN [07-18|16:45:45.169] Unexpected trienode heal packet          peer=a6d04daf         reqid=8,942,089,650,430,097,371
WARN [07-18|16:45:45.171] Unexpected trienode heal packet          peer=9e9cac22         reqid=4,503,892,313,274,969,736
WARN [07-18|16:45:45.229] Unexpected trienode heal packet          peer=5d8589a4         reqid=3,594,601,607,111,485,883
WARN [07-18|16:45:45.231] Unexpected trienode heal packet          peer=91bbf60e         reqid=5,875,472,871,662,902,193
WARN [07-18|16:45:45.259] Unexpected trienode heal packet          peer=d775bde9         reqid=1,079,832,331,273,526,425
INFO [07-18|16:45:45.378] Downloader queue stats                   receiptTasks=0     blockTasks=0     itemSize=243.11KiB throttle=1079

Jul 18 '21 16:07 mohamedmansour

I checked again, 93% of my logs are “Unexpected trienode heal packet”

if there is anything i can do to help gather more logs for understanding how to fix it, let me know.

Jul 19 '21 21:07 mohamedmansour

@mohamedmansour

The stage State heal in progress takes several hours depending on your hardware specs. How much CPU core have you? What is the network bandwidth?

Jul 21 '21 14:07 begetan

@mohamedmansour

The stage State heal in progress takes several hours depending on your hardware specs. How much CPU core have you? What is the network bandwidth?

I have 4 CPUs, 16GB RAM, 2Gbit network on Azure VM https://azure.microsoft.com/en-us/blog/introducing-b-series-our-new-burstable-vm-size/ (Standard_B4ms) = $140/month

Jul 21 '21 16:07 mohamedmansour

AWS c6g.xlarge. That's 4 high-performance cores, 16GiB, and I added a 6000 IOPS, 1000MiB/s throughput EBS volume, up to 10 Gb network. One night of snapsync.

WARN [07-23|07:18:10.749] Unexpected trienode heal packet          peer=a6d04daf         reqid=5,628,752,454,634,351,532
WARN [07-23|07:18:10.749] Unexpected trienode heal packet          peer=bfdaf535         reqid=2,129,613,245,800,917,217
WARN [07-23|07:18:10.759] Unexpected trienode heal packet          peer=30ac3a13         reqid=2,746,357,763,380,049,576
WARN [07-23|07:18:12.273] Unexpected trienode heal packet          peer=e28df4e4         reqid=1,660,082,010,578,520,634
WARN [07-23|07:18:13.454] Unexpected trienode heal packet          peer=580f3239         reqid=38,957,340,314,670,213
INFO [07-23|07:18:18.379] State heal in progress                   accounts=228,[email protected]     slots=474,[email protected]     [email protected]      nodes=2,520,[email protected] pending=22518
INFO [07-23|07:18:18.821] Imported new block headers               count=1    elapsed=5.535ms   number=12,881,249 hash=1a723b..7d6a91
WARN [07-23|07:18:21.471] Checkpoint challenge timed out, dropping id=4537608e56285b37 conn=inbound addr=65.21.95.47:34618     type=erigon/v2021.07.4/li...
INFO [07-23|07:18:26.424] State heal in progress                   accounts=228,[email protected]     slots=474,[email protected]     [email protected]      nodes=2,521,[email protected] pending=22194
WARN [07-23|07:18:26.453] Unexpected trienode heal packet          peer=e28df4e4         reqid=7,019,785,347,053,342,247
WARN [07-23|07:18:31.475] Unexpected trienode heal packet          peer=30ac3a13         reqid=8,104,986,034,277,065,248
WARN [07-23|07:18:31.492] Unexpected trienode heal packet          peer=805b1ebf         reqid=847,938,211,426,238,144
WARN [07-23|07:18:32.924] Unexpected trienode heal packet          peer=436cfb8d         reqid=5,826,258,251,133,243,579
INFO [07-23|07:18:35.204] State heal in progress                   accounts=228,[email protected]     slots=474,[email protected]     [email protected]      nodes=2,523,[email protected] pending=22234
WARN [07-23|07:18:35.205] Unexpected trienode heal packet          peer=580f3239         reqid=2,897,966,948,454,458,237
WARN [07-23|07:18:35.231] Unexpected trienode heal packet          peer=e28df4e4         reqid=6,338,490,710,152,459,505
WARN [07-23|07:18:35.240] Unexpected trienode heal packet          peer=0f8d93b7         reqid=6,558,336,474,454,724,048
WARN [07-23|07:18:36.250] Unexpected trienode heal packet          peer=c34194de         reqid=4,258,489,952,416,859,887
WARN [07-23|07:18:36.253] Unexpected trienode heal packet          peer=61551118         reqid=2,074,155,154,236,670,882
WARN [07-23|07:18:36.255] Unexpected trienode heal packet          peer=54080ee5         reqid=6,077,024,708,527,433,002
WARN [07-23|07:18:37.657] Unexpected trienode heal packet          peer=01f352da         reqid=9,041,030,120,532,686,969
WARN [07-23|07:18:38.767] Unexpected trienode heal packet          peer=a6d04daf         reqid=9,015,027,624,447,844,791
WARN [07-23|07:18:39.993] Unexpected trienode heal packet          peer=30ac3a13         reqid=708,368,138,623,231,300
INFO [07-23|07:18:40.155] Imported new block headers               count=1    elapsed=5.390ms   number=12,881,250 hash=75e84f..8e55ae
INFO [07-23|07:18:40.196] Imported new block headers               count=1    elapsed=5.099ms   number=12,881,251 hash=203c55..a9c8ba
INFO [07-23|07:18:40.299] Downloader queue stats                   receiptTasks=0     blockTasks=0     itemSize=243.22KiB throttle=1078
INFO [07-23|07:18:43.248] State heal in progress                   accounts=229,[email protected]     slots=475,[email protected]     [email protected]      nodes=2,525,[email protected] pending=22579
INFO [07-23|07:18:51.645] State heal in progress                   accounts=229,[email protected]     slots=475,[email protected]     [email protected]      nodes=2,527,[email protected] pending=22001
INFO [07-23|07:18:59.679] State heal in progress                   accounts=229,[email protected]     slots=475,[email protected]     [email protected]      nodes=2,529,[email protected] pending=22249
INFO [07-23|07:19:04.571] Imported new block headers               count=1    elapsed=5.083ms   number=12,881,252 hash=ade0fb..eb0a79
INFO [07-23|07:19:07.654] Imported new block headers               count=1    elapsed=5.050ms   number=12,881,253 hash=69ff20..5b4202
INFO [07-23|07:19:08.080] State heal in progress                   accounts=230,[email protected]     slots=475,[email protected]     [email protected]      nodes=2,530,[email protected] pending=22411
INFO [07-23|07:19:10.778] Imported new block headers               count=1    elapsed=5.106ms   number=12,881,254 hash=ce4e76..b337cd
WARN [07-23|07:19:15.211] Unexpected bytecode heal packet          peer=a6d04daf         reqid=5,663,218,223,262,397,112
INFO [07-23|07:19:16.520] State heal in progress                   accounts=230,[email protected]     slots=475,[email protected]     [email protected]      nodes=2,532,[email protected] pending=22693
WARN [07-23|07:19:21.465] Unexpected bytecode heal packet          peer=54080ee5         reqid=3,077,866,949,983,522,985
INFO [07-23|07:19:24.589] State heal in progress                   accounts=230,[email protected]     slots=476,[email protected]     [email protected]      nodes=2,533,[email protected] pending=22916
INFO [07-23|07:19:29.071] Imported new block headers               count=1    elapsed=5.203ms   number=12,881,255 hash=260e4c..1afdb0
INFO [07-23|07:19:32.599] State heal in progress                   accounts=230,[email protected]     slots=476,[email protected]     [email protected]      nodes=2,535,[email protected] pending=23256
INFO [07-23|07:19:40.710] State heal in progress                   accounts=230,[email protected]     slots=476,[email protected]     [email protected]      nodes=2,537,[email protected] pending=23830
INFO [07-23|07:19:41.280] Imported new block headers               count=1    elapsed=5.200ms   number=12,881,256 hash=9d6a30..18ed89
INFO [07-23|07:19:41.330] Downloader queue stats                   receiptTasks=0     blockTasks=0     itemSize=236.02KiB throttle=1111
INFO [07-23|07:19:48.871] State heal in progress                   accounts=231,[email protected]     slots=476,[email protected]     [email protected]      nodes=2,539,[email protected] pending=24121
INFO [07-23|07:19:56.999] State heal in progress                   accounts=231,[email protected]     slots=477,[email protected]     [email protected]      nodes=2,541,[email protected] pending=24316
INFO [07-23|07:20:05.161] State heal in progress                   accounts=231,[email protected]     slots=478,[email protected]     [email protected]      nodes=2,543,[email protected] pending=23817
INFO [07-23|07:20:13.392] State heal in progress                   accounts=231,[email protected]     slots=478,[email protected]     [email protected]      nodes=2,545,[email protected] pending=23275
INFO [07-23|07:20:21.912] State heal in progress                   accounts=231,[email protected]     slots=479,[email protected]     [email protected]      nodes=2,548,[email protected] pending=22951
INFO [07-23|07:20:29.985] State heal in progress                   accounts=231,[email protected]     slots=479,[email protected]     [email protected]      nodes=2,550,[email protected] pending=22448
INFO [07-23|07:20:29.992] Imported new block headers               count=1    elapsed=5.098ms   number=12,881,257 hash=830c05..7bd316 age=1m21s
WARN [07-23|07:20:34.268] Unexpected trienode heal packet          peer=0f8d93b7         reqid=5,110,344,849,506,036,345
WARN [07-23|07:20:34.624] Unexpected trienode heal packet          peer=c34194de         reqid=176,407,516,446,904,836
WARN [07-23|07:20:34.875] Unexpected trienode heal packet          peer=61551118         reqid=7,745,011,524,738,817,427
WARN [07-23|07:20:36.752] Unexpected trienode heal packet          peer=01f352da         reqid=8,576,927,767,064,613,316
WARN [07-23|07:20:37.340] Unexpected trienode heal packet          peer=a6d04daf         reqid=7,790,792,082,625,331,242
INFO [07-23|07:20:38.767] State heal in progress                   accounts=232,[email protected]     slots=480,[email protected]     [email protected]      nodes=2,551,[email protected] pending=22355
WARN [07-23|07:20:40.802] Unexpected trienode heal packet          peer=580f3239         reqid=2,584,556,318,708,171,106
WARN [07-23|07:20:41.044] Unexpected trienode heal packet          peer=e28df4e4         reqid=4,683,100,952,475,139,577
WARN [07-23|07:20:41.063] Unexpected trienode heal packet          peer=0f8d93b7         reqid=7,292,729,526,860,797,397
INFO [07-23|07:20:46.982] State heal in progress                   accounts=232,[email protected]     slots=480,[email protected]     [email protected]      nodes=2,553,[email protected] pending=22692
INFO [07-23|07:20:55.447] State heal in progress                   accounts=232,[email protected]     slots=481,[email protected]     [email protected]      nodes=2,556,[email protected] pending=22977
INFO [07-23|07:21:00.452] Imported new block headers               count=1    elapsed=6.045ms   number=12,881,258 hash=793478..ba029d age=1m33s
INFO [07-23|07:21:01.368] Downloader queue stats                   receiptTasks=0     blockTasks=0     itemSize=237.05KiB throttle=1106
INFO [07-23|07:21:03.583] State heal in progress                   accounts=232,[email protected]     slots=482,[email protected]     [email protected]      nodes=2,558,[email protected] pending=22942
INFO [07-23|07:21:12.387] State heal in progress                   accounts=232,[email protected]     slots=483,[email protected]     [email protected]      nodes=2,560,[email protected] pending=22521
INFO [07-23|07:21:20.939] State heal in progress                   accounts=232,[email protected]     slots=484,[email protected]     [email protected]      nodes=2,562,[email protected] pending=22119
INFO [07-23|07:21:29.133] State heal in progress                   accounts=232,[email protected]     slots=484,[email protected]     [email protected]      nodes=2,564,[email protected] pending=21908

Same result as @mohamedmansour (which also hey, love watchtheburn ❤️ ! happen to be building something v. similar).

If I'm reading the resource use correctly (everything way under capacity, chain data 376GiB large only), but the node count going up, this node is now completely constrained by how quickly it can get the data it needs from other nodes? From the looks of it, I can go down to a machine half this size.

Sounds like what @mohamedmansour and I are seeing is perhaps not entirely expected but unrelated to the bug flagged in this issue. Correct?

Unrelated to the issue, if it helps anyone, the machine is mostly CPU constrained during the sync (fascinating!), sits at 50% RAM use, IO I reserved was waaaaay overkill didn't cross 3000 IOPS, 180MiB/s.

Jul 23 '21 07:07 alextes

Heh, doing fast sync now, the machine above is averaging 80% CPU, 80% Mem, IO 100 - 250MB/s, network 200Mb/s, it's basically on fire 🔥 . Hats of to the Geth team for the efficiency of snap sync 👌 .

lmk if I should open a new issue if fast sync ends up beating out snap sync 😬 .

Jul 23 '21 10:07 alextes

I am doubling my cloud now to 20K IOPS and will see how state healing does when it comes to that stage. Bare Metal VMs is going to be the solution perhaps, I wish Geth honesty places logs saying hardware is not good enough if it can detect that.

Jul 23 '21 12:07 mohamedmansour

Snap sync done 👌 . 3 days it looks like. Surprisingly only 384G disk use. Fast sync done too, 355G used. I'm not sure it was faster but in any case, the difference is quite small it seems for these 'medium power' machines.

Seems to work fine for me with the above machine 🙌 .

Jul 25 '21 16:07 alextes

Snapsync took 1 day on AWS c6g.xlarge machine, while it may run 3 days on the AWS a1.xlarge, which is only 20-30% lower in CPU cores.

4 high-speed (not low end) cpu cores are a bare minimum for the last Geth version.

Jul 26 '21 12:07 begetan

What is the "target" number of accounts/slots/codes/nodes, currently? Interested to try to estimate my ETA ;-)

Aug 18 '21 13:08 f5b7

FWIW: My last State heal in progress log line was accounts=803,[email protected] slots=1,760,[email protected] [email protected] nodes=8,983,[email protected] pending=0. After that it was Rebuilding state snapshot (which seems to need another ~ 24 hours).

Aug 19 '21 07:08 f5b7

So is there some metric we can look at to understand (absolute) sync progress? Some target number it needs to hit? Which? Thank you!

Aug 20 '21 22:08 xanoni

I believe snap sync requires at least 6 vcpus and ssd with high iops to work.

Aug 21 '21 14:08 mohamedmansour

This might be related:

https://github.com/ethereum/go-ethereum/issues/23191#issuecomment-904306603

Similar problem here ...

I'm ~8 days in now ... the state sync took ~2 days (including some downtimes while I was setting up the rest of the server). The other 6 days it's been state healing.

RaspberryPi ARM-64 (8 GiB RAM, overclocked @ 4 x 2,100 Mhz) Disk: SSD (200+ MiB / sec) Net: 2 Gbit/s

Not sure if relevant, but I had >40,000 of the below messages in the 6 days it's been state healing: WARN [08-22|00:03:07.286] Unexpected trienode heal packet peer=XYZ reqid=XYZ

The Geth folder currently utilizes 400 GiB (no archive).

Aug 24 '21 04:08 xanoni

Is the "accounts" number supposed to just keep increasing? I noticed in the logs that it reset itself to 0 several times in the last 6 days, I think whenenver the node was restarted.

So-what so far:

40 thousand "Unexpected trienode heal packets" in 6 days
Accounts go back to 0 whenever the node is restarted

This is what it shows during shutdown ... I'm not sure why it's not considered a clean shutdown, I send it gentle signals. Just the default stuff that systemd sends.

Failed to journal state snapshot         err="snapshot [0xYYYYYYYYYYYYY....YYYYYYYYYYYYYYYYY] missing"
Writing clean trie cache to disk         path=/mnt/ssd/geth-data/geth/triecache threads=4
Persisted the clean trie cache           path=/mnt/ssd/geth-data/geth/triecache elapsed=42.197ms
Blockchain stopped

Once it resumes, I see this:

[...]
Loaded last fast-sync pivot marker       number=13,080,857
Snapshot maintenance disabled (syncing)
Loaded local transaction journal         transactions=0 dropped=0
Upgrading chain index                    type=bloombits percentage=99
Regenerated local transaction journal    transactions=0 accounts=0
[...]
Unclean shutdown detected                booted=2021-08-xxTxx:xx:xx-xxxx age=6d12h9m

and then it restarts here:

State heal in progress                   [email protected] [email protected] [email protected] nodes=37,526,[email protected] pending=      1

Aug 24 '21 04:08 xanoni

I'm experiencing this same issue. Same setup as #23191 (comment) (RaspberryPi 8Gb system with SSD) running the latest Etherium on Arm build. Geth version info:

Geth
Version: 1.10.8-stable
Git Commit: 26675454bf93bf904be7a43cce6b3f550115ff90
Git Commit Date: 20210824
Architecture: arm64
Go Version: go1.16.4
Operating System: linux
GOPATH=
GOROOT=go

Using the default snap sync and the log is full of "Unexpected trienode heal packet" entries.

#23191 (comment)

Similar problem here ... I'm ~8 days in now ... the state sync took ~2 days (including some downtimes while I was setting up the rest of the server). The other 6 days it's been state healing. RaspberryPi ARM-64 (8 GiB RAM, overclocked @ 4 x 2,100 Mhz) Disk: SSD (200+ MiB / sec) Net: 2 Gbit/s Not sure if relevant, but I had >40,000 of the below messages in the 6 days it's been state healing: WARN [08-22|00:03:07.286] Unexpected trienode heal packet peer=XYZ reqid=XYZ The Geth folder currently utilizes 400 GiB (no archive).

Aug 28 '21 11:08 ttibbetts

I switched to fast sync (maintaining the already downloaded data), but it was still not done after 2-3 days. Thus, I rsync'd the whole geth folder to an aarch64 machine and completed the snap sync there, which took 1-2 days. Then I rsync'd everything back and it's running smoothly on the Pi now.

Not ideal, but worked for me.

Aug 28 '21 14:08 xanoni

I run several geth nodes, but I've always used Sync mode fast. I am building a new machine, so I figured I would try snap (the new default). This is running on a HP DL 380 G5 with 24 GB of RAM and a 1TB SSD.

Using SNAP sync, geth will go into these modes of compacting the database and it completely halts the sync process. It does this for over an hour sometimes. But, I let it continue. Now, this morning my sync appears to be stuck in "state heal in progress". It looks like it's been at this for hours. I give up. I am going back to Fast.

How did u go back to fast without abandon the data u have downloaded, whats the command like? thx!

Sep 14 '21 04:09 lrc123456

if my state healing looks like t=2021-10-05T13:53:30+0000 lvl=info msg="State heal in progress" [email protected] [email protected] [email protected] nodes=408,[email protected] pending=55412 this nodes should be 8 million+?

Oct 05 '21 13:10 anhbaysgalan1

@ttibbetts

Raspberry Pi with 4 Core is probably not capable of doing snapshot sync. Try to disable this feature and use regular sync.

Oct 06 '21 12:10 begetan

If this can help someone I had this issue and in my case it was because I started geth from the config.toml file and the node was overwriting the cache attribute to default. Starting it with --cache xxx --config /path/to/config solved it for me.

Oct 06 '21 13:10 Fillicia

Trying to do a snap sync on a recent Ryzen machine (not just a Pi, it should be performant enough) and it seems to be sitting at this for over a day now.

INFO [10-08|17:56:51.721] State heal in progress                   [email protected] [email protected] [email protected] nodes=1,677,[email protected] pending=1

It seems to have imported receipts and headers to tip (or near to it, at least)

INFO [10-08|17:52:15.561] Imported new block receipts              count=27  elapsed=51.550ms   number=13,379,325 hash=744aa1..d23c2c age=1h5m56s  size=2.88MiB
INFO [10-08|17:52:15.730] Imported new block receipts              count=25  elapsed=56.574ms   number=13,379,350 hash=56a47b..4a45e8 age=1h2m33s  size=2.77MiB
INFO [10-08|17:52:15.887] Imported new block receipts              count=19  elapsed=63.894ms   number=13,379,369 hash=e15d90..6a1567 age=58m20s   size=2.88MiB
INFO [10-08|17:52:16.025] Imported new block receipts              count=23  elapsed=52.420ms   number=13,379,392 hash=2b93bf..9dc915 age=54m      size=2.66MiB
INFO [10-08|17:52:16.302] Imported new block receipts              count=79  elapsed=213.982ms  number=13,379,471 hash=d8a9e1..1b1d62 age=35m11s   size=11.19MiB

Oct 08 '21 18:10 delbonis

The nodes-count is increasing -- it's progressing.

@holiman How high does the nodes-count need to go?

I'm at ".. nodes=86,597,863"

Oct 08 '21 20:10 crites

I'd also like to know this.. I'm (sadly) only on nodes=23,591,000. Wondering if I should let the state heal complete or switch to fast sync as it seems others have had success. Thank you for all the hard work you folks put in!!

Nov 10 '21 20:11 pryce-turner

My Geth 1.10.16 node running snap sync mode also reports regular "State heal in progress" and a frequent stream of repeated "Unexpected trienode heal packet" log messages. It also reports regular "Snapshot extension registration failed" messages. I don't recall seeing these messages when I was running Geth 1.10.15. I switched Geth's sync mode from "fast" to "snap" starting with Geth 1.10.15 because "fast" sync mode was no longer available.

Mar 01 '22 18:03 derekmahar

After downgrading Geth from 1.10.16 to 1.10.15, Geth still reports frequent "Snapshot extension registration failed" and "State heal in progress" log messages, but reported no "Unexpected trienode heal packet" log messages until after I pressed Ctrl-c to shut down the node.

Mar 01 '22 19:03 derekmahar

I am waiting for more than 1 day after a snap sync, in a beefy machine with beefy internet connection Stuck in "State heal in progress" with no indication of ETA whatsoever What's the point of downloading the state from the peers to speed up the sync, just to waste time doing "State heal" later? 👎

Mar 14 '22 15:03 anonqoder

Mar 14 15:47:21 host1 geth[75528]: INFO [03-14|15:47:21.853] State heal in progress                   accounts=103,[email protected]      slots=180,[email protected]      [email protected]     nodes=901,[email protected] pending=104,371

If this "heal" is gonna "fix" 175M Ethereum accounts (it is now on 103774), this will take more than a month to complete, what exactly is it going to go?

Mar 14 '22 15:03 anonqoder

I wish Geth honesty places logs saying hardware is not good enough if it can detect that.

totally concur, if you are providing an app bound to hardware you must benchmark the hardware before you start the app, otherwise your app won't work and since you are the one responsible for providing the app you are also guilty of letting the user running the app on a slow hardware (implicitly). Ethereum team has been giving us low quality engineering since 2015

Jul 22 '22 13:07 nuliknol

Mar 14 15:47:21 host1 geth[75528]: INFO [03-14|15:47:21.853] State heal in progress                   accounts=103,[email protected]      slots=180,[email protected]      [email protected]     nodes=901,[email protected] pending=104,371
If this "heal" is gonna "fix" 175M Ethereum accounts (it is now on 103774), this will take more than a month to complete, what exactly is it going to go?

The state heal process is a little deceiving. I was at first worried that my Raspberry Pi 4B + 2TB SSD was having trouble syncing as during state heal the accounts, slots, and nodes were going up slowly relative to how many accounts/slots there are in total.

My state heal finished in about 21 hours at 1,182,357 accounts, 2,327,247 slots, and 12,186,486 nodes. I also had many instances of Unexpected trienode heal packet which didn't seem to affect anything. I have seen other people finish at numbers larger than those so I think its very dependant on your own sync, but you definitely don't need to reach 175M accounts!

Aug 24 '22 15:08 leontodd

That's good to know @leontodd .

But the question remains - how do I get any indication on the progress/ETA of the state heal progress?

Aug 25 '22 04:08 kazazor

I would like to know this too. On my first attempt geth sudenly reported that something was not in order and that the blockchain had to be rewinded with target=0 (aka "Start From Scratch") ; Which I then did and am now in state heal. Sync started (again) on August 4th, sync was done in three days, state heal runs since August 7th (24 days now).

Output looks like this:

INFO [08-31|08:40:43.441] State heal in progress                   [email protected] [email protected] [email protected] nodes=12,480,[email protected] pending=18993
INFO [08-31|08:40:43.451] State heal in progress                   [email protected] [email protected] [email protected] nodes=12,480,[email protected] pending=18993
INFO [08-31|08:40:43.542] Imported new block headers               count=1   elapsed=3.267ms  number=15,445,101 hash=ee34c1..82f2c5
WARN [08-31|08:40:43.550] Unexpected trienode heal packet          peer=4efd3711 reqid=7,328,963,860,793,291,079
WARN [08-31|08:40:43.550] Unexpected trienode heal packet          peer=9bf27c44 reqid=934,092,868,083,411,094
WARN [08-31|08:40:43.550] Unexpected trienode heal packet          peer=5a070bad reqid=7,559,473,032,996,682,308
WARN [08-31|08:40:43.550] Unexpected trienode heal packet          peer=1290f9f1 reqid=7,402,543,419,581,238,141
WARN [08-31|08:40:43.551] Unexpected trienode heal packet          peer=a85cfbac reqid=5,676,764,028,462,137,471
WARN [08-31|08:40:43.551] Unexpected trienode heal packet          peer=de7b486d reqid=3,356,032,926,609,925,242
WARN [08-31|08:40:43.551] Unexpected trienode heal packet          peer=9bd73479 reqid=3,878,834,481,290,902,674
WARN [08-31|08:40:43.551] Unexpected trienode heal packet          peer=468924e2 reqid=6,636,285,917,303,600,443
WARN [08-31|08:40:43.553] Unexpected trienode heal packet          peer=92adf5a4 reqid=3,533,469,970,467,382,671
WARN [08-31|08:40:43.554] Unexpected trienode heal packet          peer=703154b2 reqid=4,629,776,562,703,112,673
WARN [08-31|08:40:43.698] Unexpected trienode heal packet          peer=b70c1667 reqid=367,965,809,434,615,351
INFO [08-31|08:40:44.452] Downloader queue stats                   receiptTasks=0   blockTasks=0   itemSize=173.41KiB throttle=1512
WARN [08-31|08:40:44.452] Pivot became stale, moving               old=15,444,960 new=15,445,039
INFO [08-31|08:40:44.536] Imported new block receipts              count=79  elapsed=83.833ms number=15,445,038 hash=d3de2b..2731b3 age=12m51s   size=11.35MiB
INFO [08-31|08:40:44.536] State heal in progress                   [email protected] [email protected] [email protected] nodes=12,480,[email protected] pending=1
WARN [08-31|08:40:44.812] Unexpected trienode heal packet          peer=4883dc3a reqid=3,218,495,347,431,903,237
INFO [08-31|08:40:47.007] Imported new block headers               count=1   elapsed=3.586ms  number=15,445,102 hash=d8caad..e1f5ef
INFO [08-31|08:40:53.705] Imported new block headers               count=1   elapsed=3.250ms  number=15,445,103 hash=4eae4e..75c64a
INFO [08-31|08:40:58.106] State heal in progress                   [email protected] [email protected] [email protected] nodes=12,480,[email protected] pending=273

"pending" normally rises until it is somewhere between 110,000 and 130,000. Then a bunch of "Unexpected trienode heal packages" lines appear, and "pending" is restarted at 1. It is rare that this happens with "pending" being below 20k like above.

I also get lines telling me that geth doesn't like nimbus. (It is running and fully synced. Connection to geth, according to nimbus output, works) WARN [08-31|08:41:47.076] Merge is configured, but no beacon client seen. Please ensure you have one available before the transition arrives!

What really puzzles me is the output [email protected] [email protected] [email protected] in the state heal progress. I remember that these had much higher numbers on my first try.

And I am really anxious about the disk space used. This goes beyond anything I read online an eth1 node would need:

 $ LC_ALL=C du -hsc * 2>/dev/null
628G    chaindata
5.0K    jwtsecret
512     lightchaindata
512     LOCK
4.5K    nodekey
5.7M    nodes
512     transactions.rlp
2.8M    triecache
628G    total

(But it has not been growing much in the last 2 weeks.)

This is my current sync state, and while there is some progress, it has been minimal over the past 3 weeks.

Welcome to the Geth JavaScript console!

instance: Geth/v1.10.23-stable/linux-amd64/go1.19
coinbase: <removed>
at block: 0 (Thu Jan 01 1970 01:00:00 GMT+0100 (CET))
 datadir: <removed>
 modules: admin:1.0 debug:1.0 engine:1.0 eth:1.0 ethash:1.0 miner:1.0 net:1.0 personal:1.0 rpc:1.0 txpool:1.0 web3:1.0

To exit, press ctrl-d or type exit
> eth.syncing
{
  currentBlock: 15445102,
  healedBytecodeBytes: 213959,
  healedBytecodes: 42,
  healedTrienodeBytes: 6573144668,
  healedTrienodes: 12481207,
  healingBytecode: 0,
  healingTrienodes: 0,
  highestBlock: 15445199,
  startingBlock: 15442752,
  syncedAccountBytes: 44351333509,
  syncedAccounts: 229300287,
  syncedBytecodeBytes: 4036242554,
  syncedBytecodes: 622973,
  syncedStorage: 788616132,
  syncedStorageBytes: 168788862624
}

I am really thinking about starting from scratch...

Aug 31 '22 07:08 Yamakuzure

Sync started (again) on August 4th, sync was done in three days, state heal runs since August 7th (24 days now).

If it's been trying to heal for 24 days, then it's not going to succeed. The thing is, with this phase, that it has a nearly accurate state trie, with some errors in it. And it starts trying to patch it up. The errors are due to the block progression during the earlier phases, so if the first part wad done "in three days", then basically it's three days of state changes that needs to be applied to the state trie.

Now, while the state is healing, blocks are also progressed. So the state keeps moving. Therefore, every ~192 blocks or so, we need to move the 'pivot point' along, to target a more recent state.

WARN [08-31|08:40:44.452] Pivot became stale, moving               old=15,444,960 new=15,445,039

Whenever we do so, we need to start from the root again, and go downwards while healing the trie. That is why pending becomes zero, so that's fine and as it should be.

But, if the blockchain state changes more rapidly than you are able to heal, then you will never catch up. These are the things that factor in:

The amount of gas spent in each block (more gas == more changes)
The speed of your disk (healing the trie is heavy on disk)
The speed/latency of your internet connection

Aug 31 '22 07:08 holiman

Now, while the state is healing, blocks are also progressed. So the state keeps moving. Therefore, every ~192 blocks or so, we need to move the 'pivot point' along, to target a more recent state.
WARN [08-31|08:40:44.452] Pivot became stale, moving               old=15,444,960 new=15,445,039
Whenever we do so, we need to start from the root again, and go downwards while healing the trie. That is why pending becomes zero, so that's fine and as it should be.

Thanks for the explanation!

But, if the blockchain state changes more rapidly than you are able to heal, then you will never catch up. These are the things that factor in:

The amount of gas spent in each block (more gas == more changes)

The speed of your disk (healing the trie is heavy on disk)

 $ LC_ALL=C dd if=/dev/zero of=.gethdata/test1.img bs=4096 count=262144 
262144+0 records in
262144+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.84334 s, 1.3 GB/s

The speed/latency of your internet connection Using github.com/taganaka/SpeedTest the results in download speed vary between 100 and 380 Mbit/s. Upstream is always about 52Mbit/s. Ping varies between 8-12 ms and jitter is always <5ms (Usually 0).

This all lets me think that there must be something wrong within my database which causes the heal to stall and never to catch up. Best would be to throw it all away I guess... :unamused:

It would be nice if geth could detect such a state, though, and inform the user. Whether that's even possible is another question. :wink:

Aug 31 '22 09:08 Yamakuzure

I ran geth removedb and removed both the local database and ancient database. When resyncing, the state heal was done in under 24 hours.

I tried this again but this time only removed the local database, and kept the ancient database. This time when resyncing the state heal went on for days without completing.

I suspect that there are too many states that need healing when resyncing from the ancient database, especially on slower hardware. This fits with your endless state heal @Yamakuzure as you resynced by rewinding the chain. I don't know much about the technical aspects of the state heal process, but resyncing from scratch has definitely helped in my case.

Aug 31 '22 11:08 leontodd

Possible fix: https://github.com/ethereum/go-ethereum/pull/25651 . People who have reported problems -- did you run with any particular settings for the amount of peers?

Aug 31 '22 16:08 holiman

I suspect that there are too many states that need healing when resyncing from the ancient database, especially on slower hardware. This fits with your endless state heal @Yamakuzure as you resynced by rewinding the chain. I don't know much about the technical aspects of the state heal process, but resyncing from scratch has definitely helped in my case.

Oh, sorry, I broke off the (super-slow) rewinding, as a target of 0 meant an rm -rf * would do the same anyway, just faster. I started from scratch (aka empty geth data directory) everytime and never kept the nodes or ancient data directories.

Possible fix: #25651 . People who have reported problems -- did you run with any particular settings for the amount of peers?

No.

Sep 04 '22 12:09 Yamakuzure