erigon Losing Sync Since March 13

My node on ETH Mainnet has been getting out of sync repeatedly since March 13. The main unique thing about my setup is that snapshots are on an external SSD (symlinked) to save space on the NVMe.

System information

Erigon version: ./erigon --version

erigon version 2.58.2-125509e4

Updated to 2.59.0 and still losing sync

OS & Version: Windows/Linux/OSX

Distributor ID:	Ubuntu
Description:	Ubuntu 20.04.5 LTS
Release:	20.04
Codename:	focal

Commit hash:

Erigon Command (with flags/config):

[Unit]
Description=Erigon Execution Layer Client service
Wants=network-online.target
After=network-online.target

[Service]
Type=simple
User=dylan
Restart=on-failure
RestartSec=3
KillSignal=SIGINT
TimeoutStopSec=300
ExecStart=/home/dylan/erigon/build/bin/erigon \
  --datadir /var/lib/erigon \
  --chain mainnet \
  --db.size.limit=8TB \
  --metrics \
  --pprof \
  --prune htc \
  --authrpc.jwtsecret=/secrets/jwtsecret \
  --http \
  --http.addr 192.168.0.21 \
  --http.api eth,net,web3

[Install]
WantedBy=multi-user.target

Consensus Layer:

Lighthouse v5.1.1-2a3c709
BLS library: blst-modern
SHA256 hardware acceleration: false
Allocator: jemalloc
Profile: maxperf
Specs: mainnet (true), minimal (false), gnosis (true)

Consensus Layer Command (with flags/config):

[Unit]
Description=eth beacon chain service
Wants=network-online.target
After=network-online.target

[Service]
Type=simple
User=dylan
Restart=on-failure
ExecStart=/usr/bin/lighthouse bn \
  --network mainnet \
  --staking \
  --validator-monitor-auto \
  --metrics \
  --suggested-fee-recipient <redacted> \
  --checkpoint-sync-url=https://beaconstate.info \
  --execution-endpoint http://127.0.0.1:8551 \
  --execution-jwt /secrets/jwtsecret \
  --monitoring-endpoint https://beaconcha.in/api/v1/client/metrics?apikey=<redacted>&machine=Nuc

[Install]
WantedBy=multi-user.target

Chain/Network: ETH Mainnet

Expected behaviour

Stay synced.

Actual behaviour

Losing sync often

Steps to reproduce the behaviour

Backtrace

[backtrace]

Mar 20 '24 01:03 dylancaponi

Let me know what else I could grep the logs for to help diagnose this issue. I didn't see warning or error.

Mar 20 '24 01:03 dylancaponi

One thing I've noticed is that the execution phase seems very slow:

Mar 19 21:39:02 nuc erigon[986]: [INFO] [03-19|21:39:02.819] [4/12 Execution] Executed blocks         number=19470961 blk/s=0.1 tx/s=27.7 Mgas/s=2.0 gasState=0.00 batch=861.6KB alloc=3.6GB sys=8.0GB
Mar 19 21:39:41 nuc erigon[986]: [INFO] [03-19|21:39:41.942] [4/12 Execution] Executed blocks         number=19470965 blk/s=0.1 tx/s=19.4 Mgas/s=2.1 gasState=0.00 batch=1012.5KB alloc=2.9GB sys=8.0GB
Mar 19 21:40:05 nuc erigon[986]: [INFO] [03-19|21:40:05.355] [4/12 Execution] Executed blocks         number=19470969 blk/s=0.2 tx/s=28.5 Mgas/s=2.2 gasState=0.00 batch=1.1MB alloc=3.1GB sys=8.0GB

Mar 20 '24 01:03 dylancaponi

Ah, here are some errors:

Mar 19 15:19:28 nuc erigon[986]: [WARN] [03-19|15:19:28.865] [snapshots] retire blocks                err="DumpBlocks: DumpBodies: header missed in db: block_num=18500000,  hash=80ce6bb0e244fbdf66cf0a1108273fe1ca58788efb7fb8d3a0d783d2b06d433d"
Mar 19 15:25:57 nuc erigon[986]: [WARN] [03-19|15:25:57.855] [snapshots] retire blocks                err="DumpBlocks: DumpBodies: header missed in db: block_num=18500000,  hash=80ce6bb0e244fbdf66cf0a1108273fe1ca58788efb7fb8d3a0d783d2b06d433d"
Mar 19 15:41:04 nuc erigon[986]: [WARN] [03-19|15:41:04.040] [snapshots] retire blocks                err="DumpBlocks: DumpBodies: header missed in db: block_num=18500000,  hash=80ce6bb0e244fbdf66cf0a1108273fe1ca58788efb7fb8d3a0d783d2b06d433d"
Mar 19 16:00:14 nuc erigon[986]: [WARN] [03-19|16:00:11.640] [snapshots] retire blocks                err="DumpBlocks: DumpBodies: header missed in db: block_num=18500000,  hash=80ce6bb0e244fbdf66cf0a1108273fe1ca58788efb7fb8d3a0d783d2b06d433d"
Mar 19 16:14:39 nuc erigon[986]: [WARN] [03-19|16:14:39.917] [snapshots] retire blocks                err="DumpBlocks: DumpBodies: header missed in db: block_num=18500000,  hash=80ce6bb0e244fbdf66cf0a1108273fe1ca58788efb7fb8d3a0d783d2b06d433d"
Mar 19 16:30:22 nuc erigon[986]: [WARN] [03-19|16:30:22.702] [snapshots] retire blocks                err="DumpBlocks: DumpBodies: header missed in db: block_num=18500000,  hash=80ce6bb0e244fbdf66cf0a1108273fe1ca58788efb7fb8d3a0d783d2b06d433d"
Mar 19 16:39:22 nuc erigon[986]: [WARN] [03-19|16:39:22.187] [snapshots] retire blocks                err="DumpBlocks: DumpBodies: header missed in db: block_num=18500000,  hash=80ce6bb0e244fbdf66cf0a1108273fe1ca58788efb7fb8d3a0d783d2b06d433d"
Mar 19 17:03:45 nuc erigon[986]: [WARN] [03-19|17:03:45.462] [snapshots] retire blocks                err="DumpBlocks: DumpBodies: header missed in db: block_num=18500000,  hash=80ce6bb0e244fbdf66cf0a1108273fe1ca58788efb7fb8d3a0d783d2b06d433d"
Mar 19 17:57:16 nuc erigon[986]: [WARN] [03-19|17:57:16.289] [snapshots] retire blocks                err="DumpBlocks: DumpBodies: header missed in db: block_num=18500000,  hash=80ce6bb0e244fbdf66cf0a1108273fe1ca58788efb7fb8d3a0d783d2b06d433d"
Mar 19 18:45:31 nuc erigon[986]: [WARN] [03-19|18:45:31.304] [snapshots] retire blocks                err="DumpBlocks: DumpBodies: header missed in db: block_num=18500000,  hash=80ce6bb0e244fbdf66cf0a1108273fe1ca58788efb7fb8d3a0d783d2b06d433d"
Mar 19 18:56:29 nuc erigon[986]: [WARN] [03-19|18:56:29.358] [snapshots] retire blocks                err="DumpBlocks: DumpBodies: header missed in db: block_num=18500000,  hash=80ce6bb0e244fbdf66cf0a1108273fe1ca58788efb7fb8d3a0d783d2b06d433d"
Mar 19 19:17:38 nuc erigon[986]: [WARN] [03-19|19:17:38.316] [snapshots] retire blocks                err="DumpBlocks: DumpBodies: header missed in db: block_num=18500000,  hash=80ce6bb0e244fbdf66cf0a1108273fe1ca58788efb7fb8d3a0d783d2b06d433d"
Mar 19 20:50:50 nuc erigon[986]: [WARN] [03-19|20:50:50.687] [snapshots] retire blocks                err="DumpBlocks: DumpBodies: header missed in db: block_num=18500000,  hash=80ce6bb0e244fbdf66cf0a1108273fe1ca58788efb7fb8d3a0d783d2b06d433d"
Mar 20 01:06:51 nuc erigon[986]: [WARN] [03-20|01:06:51.827] [fetch] onNewBlock                       err="failed MdbxKV cursor.Seek(): mdbx_cursor_get: MDBX_BAD_VALSIZE: Invalid size or alignment of key or data for target database, either invalid subDB name, bucket: PoolTransaction,  key: ed80609b55bbb9a63618c9c6da1ed8d31d0a49962a98f27d69226dc9a528bcfa"
Mar 20 01:09:54 nuc erigon[986]: [WARN] [03-20|01:09:54.820] [snapshots] retire blocks                err="DumpBlocks: DumpBodies: header missed in db: block_num=18500000,  hash=80ce6bb0e244fbdf66cf0a1108273fe1ca58788efb7fb8d3a0d783d2b06d433d"

Mar 20 '24 01:03 dylancaponi

Remove datadir/txpool
Add —snap.stop until next erigon’s upgrade

Mar 20 '24 05:03 AskAlexSharov

Remove datadir/txpool

Add —snap.stop until next erigon’s upgrade

Thanks ended up upgrading to 2.59 and all good so far. Will reopen if not.

Mar 20 '24 19:03 dylancaponi

Reopening because I’m still losing sync on 2.59.0. Will post logs tonight.

Mar 21 '24 02:03 dylancaponi

Mar 21 19:01:53 nuc erigon[441457]: [WARN] [03-21|19:01:23.623] [engine] GetHeaderByHash                 err="ethereumExecutionModule.GetHeader: could not open database: context canceled"
Mar 21 19:01:53 nuc erigon[441457]: [WARN] [03-21|19:01:23.623] [engine] GetHeaderByHash                 err="ethereumExecutionModule.GetHeader: could not open database: context canceled"
Mar 21 19:01:53 nuc erigon[441457]: [WARN] [03-21|19:01:23.623] [engine] GetTd                           err="ethereumExecutionModule.GetHeader: could not open database: context canceled"
Mar 21 19:01:53 nuc erigon[441457]: [WARN] [03-21|19:01:23.623] [engine] GetTd                           err="ethereumExecutionModule.GetHeader: could not open database: context canceled"
Mar 21 19:01:53 nuc erigon[441457]: [WARN] [03-21|19:01:23.623] [engine] CurrentHeader                   err="ethereumExecutionModule.CurrentHeader: could not open database: context canceled"
Mar 21 19:01:53 nuc erigon[441457]: [WARN] [03-21|19:01:23.623] [engine] CurrentHeader                   err="ethereumExecutionModule.CurrentHeader: could not open database: context canceled"
Mar 21 19:32:56 nuc erigon[441457]: [WARN] [03-21|19:32:37.412] [engine] GetHeaderByHash                 err="ethereumExecutionModule.GetHeader: could not open database: context canceled"
Mar 21 19:32:56 nuc erigon[441457]: [WARN] [03-21|19:32:37.412] [engine] GetTd                           err="ethereumExecutionModule.GetHeader: could not open database: context canceled"
Mar 21 19:32:56 nuc erigon[441457]: [WARN] [03-21|19:32:37.412] [engine] CurrentHeader                   err="ethereumExecutionModule.CurrentHeader: could not open database: context canceled"
Mar 21 19:35:21 nuc erigon[441457]: [WARN] [03-21|19:34:58.287] [engine] GetHeaderByHash                 err="ethereumExecutionModule.GetHeader: could not open database: context canceled"
Mar 21 19:35:21 nuc erigon[441457]: [WARN] [03-21|19:35:21.201] [engine] GetTd                           err="ethereumExecutionModule.GetHeader: could not open database: context canceled"
Mar 21 19:35:21 nuc erigon[441457]: [WARN] [03-21|19:35:21.201] [engine] CurrentHeader                   err="ethereumExecutionModule.CurrentHeader: could not open database: context canceled"

Mar 21 '24 20:03 dylancaponi

hey could try to use --internalcl, just to see if it is an engine API specific issue? CC @somnathb1

Mar 21 '24 22:03 Giulio2002

@somnathb1 Suggested on discord:

Could you also (in this order) check the disk speed/performance in terms of access times - for all symlinks and datadir check the above again after running erigon try removing any competing process accessing the datadir (such as another erigon process, or an integration/mdbx command line) try removing $datadir/chaindata/mdbx.lck and restart the erigon process/container

Mar 22 '24 17:03 dylancaponi

Here are fio results testing snapshots (symlinked) and datadir while Erigon is stopped and a few minutes after starting it.

Snapshots before starting Erigon:

fio --name=Test --rw=read --size=500M --directory=/var/lib/erigon/snapshots
Test: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.16
Starting 1 process
Test: Laying out IO file (1 file / 500MiB)

Test: (groupid=0, jobs=1): err= 0: pid=584985: Fri Mar 22 17:38:03 2024
  read: IOPS=151k, BW=590MiB/s (618MB/s)(500MiB/848msec)
    clat (nsec): min=570, max=838099, avg=6410.46, stdev=45685.29
     lat (nsec): min=595, max=838122, avg=6437.37, stdev=45685.45
    clat percentiles (nsec):
     |  1.00th=[   612],  5.00th=[   636], 10.00th=[   644], 20.00th=[   660],
     | 30.00th=[   676], 40.00th=[   684], 50.00th=[   700], 60.00th=[   716],
     | 70.00th=[   740], 80.00th=[   788], 90.00th=[   892], 95.00th=[  1020],
     | 99.00th=[329728], 99.50th=[391168], 99.90th=[473088], 99.95th=[497664],
     | 99.99th=[602112]
   bw (  KiB/s): min=601568, max=601568, per=99.63%, avg=601568.00, stdev= 0.00, samples=1
   iops        : min=150392, max=150392, avg=150392.00, stdev= 0.00, samples=1
  lat (nsec)   : 750=71.79%, 1000=22.81%
  lat (usec)   : 2=2.29%, 4=1.52%, 10=0.03%, 20=0.01%, 50=0.01%
  lat (usec)   : 100=0.01%, 250=0.04%, 500=1.48%, 750=0.05%, 1000=0.01%
  cpu          : usr=5.08%, sys=23.61%, ctx=1999, majf=0, minf=12
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=128000,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=590MiB/s (618MB/s), 590MiB/s-590MiB/s (618MB/s-618MB/s), io=500MiB (524MB), run=848-848msec

Disk stats (read/write):
  sda: ios=1707/0, merge=0/0, ticks=1260/0, in_queue=0, util=88.46%

Datadir before starting Erigon:

fio --name=Test --rw=read --size=500M --directory=/var/lib/erigon
Test: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.16
Starting 1 process
Test: Laying out IO file (1 file / 500MiB)

Test: (groupid=0, jobs=1): err= 0: pid=585044: Fri Mar 22 17:41:01 2024
  read: IOPS=204k, BW=796MiB/s (835MB/s)(500MiB/628msec)
    clat (nsec): min=563, max=951461, avg=4532.92, stdev=31085.75
     lat (nsec): min=587, max=951488, avg=4580.54, stdev=31086.00
    clat percentiles (nsec):
     |  1.00th=[   604],  5.00th=[   636], 10.00th=[   668], 20.00th=[   724],
     | 30.00th=[   764], 40.00th=[   804], 50.00th=[   868], 60.00th=[  1020],
     | 70.00th=[  1160], 80.00th=[  1448], 90.00th=[  2008], 95.00th=[  2192],
     | 99.00th=[171008], 99.50th=[261120], 99.90th=[444416], 99.95th=[473088],
     | 99.99th=[514048]
   bw (  KiB/s): min=808432, max=808432, per=99.16%, avg=808432.00, stdev= 0.00, samples=1
   iops        : min=202108, max=202108, avg=202108.00, stdev= 0.00, samples=1
  lat (nsec)   : 750=25.47%, 1000=33.82%
  lat (usec)   : 2=30.21%, 4=8.71%, 10=0.16%, 20=0.04%, 50=0.08%
  lat (usec)   : 100=0.22%, 250=0.73%, 500=0.55%, 750=0.02%, 1000=0.01%
  cpu          : usr=11.00%, sys=47.53%, ctx=1531, majf=0, minf=13
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=128000,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=796MiB/s (835MB/s), 796MiB/s-796MiB/s (835MB/s-835MB/s), io=500MiB (524MB), run=628-628msec

Disk stats (read/write):
    dm-0: ios=1982/72, merge=0/0, ticks=852/0, in_queue=852, util=86.59%, aggrios=2009/68, aggrmerge=0/4, aggrticks=871/35, aggrin_queue=0, aggrutil=84.61%
  nvme0n1: ios=2009/68, merge=0/4, ticks=871/35, in_queue=0, util=84.61%

Snapshots after starting Erigon:

fio --name=Test --rw=read --size=500M --directory=/var/lib/erigon/snapshots
Test: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.16
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=69.0MiB/s][r=17.7k IOPS][eta 00m:00s]
Test: (groupid=0, jobs=1): err= 0: pid=585143: Fri Mar 22 17:44:06 2024
  read: IOPS=16.9k, BW=66.1MiB/s (69.3MB/s)(500MiB/7564msec)
    clat (nsec): min=462, max=191793k, avg=58581.45, stdev=773479.15
     lat (nsec): min=482, max=191793k, avg=58607.67, stdev=773479.12
    clat percentiles (nsec):
     |  1.00th=[     548],  5.00th=[     588], 10.00th=[     604],
     | 20.00th=[     628], 30.00th=[     644], 40.00th=[     668],
     | 50.00th=[     692], 60.00th=[     724], 70.00th=[     756],
     | 80.00th=[     804], 90.00th=[     900], 95.00th=[    1080],
     | 99.00th=[ 1810432], 99.50th=[ 5406720], 99.90th=[ 7438336],
     | 99.95th=[ 8159232], 99.99th=[11075584]
   bw (  KiB/s): min=42464, max=73216, per=98.63%, avg=66762.67, stdev=10164.32, samples=15
   iops        : min=10616, max=18304, avg=16690.67, stdev=2541.08, samples=15
  lat (nsec)   : 500=0.09%, 750=67.39%, 1000=26.13%
  lat (usec)   : 2=4.14%, 4=0.58%, 10=0.09%, 20=0.01%, 50=0.02%
  lat (usec)   : 100=0.05%, 250=0.06%, 500=0.10%, 750=0.06%, 1000=0.08%
  lat (msec)   : 2=0.23%, 4=0.23%, 10=0.72%, 20=0.01%, 250=0.01%
  cpu          : usr=0.70%, sys=5.66%, ctx=1918, majf=0, minf=12
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=128000,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=66.1MiB/s (69.3MB/s), 66.1MiB/s-66.1MiB/s (69.3MB/s-69.3MB/s), io=500MiB (524MB), run=7564-7564msec

Disk stats (read/write):
  sda: ios=23086/8, merge=0/24, ticks=166931/35, in_queue=114592, util=97.68%

Datadir after starting Erigon:

fio --name=Test --rw=read --size=500M --directory=/var/lib/erigon
Test: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.16
Starting 1 process
Jobs: 1 (f=1)
Test: (groupid=0, jobs=1): err= 0: pid=585208: Fri Mar 22 17:46:07 2024
  read: IOPS=116k, BW=453MiB/s (475MB/s)(500MiB/1104msec)
    clat (nsec): min=453, max=12097k, avg=7512.28, stdev=99586.78
     lat (nsec): min=475, max=12097k, avg=7611.07, stdev=99587.34
    clat percentiles (nsec):
     |  1.00th=[    532],  5.00th=[    572], 10.00th=[    636],
     | 20.00th=[    948], 30.00th=[   1144], 40.00th=[   1672],
     | 50.00th=[   1992], 60.00th=[   2192], 70.00th=[   2288],
     | 80.00th=[   2384], 90.00th=[   2576], 95.00th=[   2768],
     | 99.00th=[ 214016], 99.50th=[ 257024], 99.90th=[ 888832],
     | 99.95th=[1597440], 99.99th=[3751936]
   bw (  KiB/s): min=406496, max=467400, per=94.22%, avg=436948.00, stdev=43065.63, samples=2
   iops        : min=101624, max=116850, avg=109237.00, stdev=10766.41, samples=2
  lat (nsec)   : 500=0.14%, 750=13.74%, 1000=7.99%
  lat (usec)   : 2=28.14%, 4=48.05%, 10=0.26%, 20=0.08%, 50=0.05%
  lat (usec)   : 100=0.11%, 250=0.88%, 500=0.41%, 750=0.03%, 1000=0.03%
  lat (msec)   : 2=0.05%, 4=0.03%, 10=0.01%, 20=0.01%
  cpu          : usr=12.06%, sys=52.49%, ctx=1192, majf=0, minf=11
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=128000,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=453MiB/s (475MB/s), 453MiB/s-453MiB/s (475MB/s-475MB/s), io=500MiB (524MB), run=1104-1104msec

Disk stats (read/write):
    dm-0: ios=1406/1474, merge=0/0, ticks=1012/4020, in_queue=5032, util=99.29%, aggrios=2075/2099, aggrmerge=0/67, aggrticks=1310/9904, aggrin_queue=8056, aggrutil=99.59%
  nvme0n1: ios=2075/2099, merge=0/67, ticks=1310/9904, in_queue=8056, util=99.59%

Mar 22 '24 17:03 dylancaponi

I went back to the time that validation actually started failing to look at logs and the errors right before failure were DumpBlocks: DumpBodies: header missed in db:

Mar 20 14:34:25 nuc erigon[986]: [INFO] [03-20|14:34:25.952] head updated                             hash=0xda49c626629e22ea4181b0990f0d4fc31e21f8cc574dd4d3f08467b466997d98 number=19476536
Mar 20 14:34:26 nuc erigon[986]: [INFO] [03-20|14:34:26.018] [snapshots] Prune Blocks                 to=18500000 limit=100
Mar 20 14:34:26 nuc erigon[986]: [WARN] [03-20|14:34:26.019] [snapshots] retire blocks                err="DumpBlocks: DumpBodies: header missed in db: block_num=18500000,  hash=80ce6bb0e244fbdf66cf0a1108273fe1ca58788efb7fb8d3a0d783d2b06d433d"
Mar 20 14:34:37 nuc erigon[986]: [INFO] [03-20|14:34:37.741] [NewPayload] Handling new payload        height=19476537 hash=0x37a1c155eaa2d740f2103c42a2465f20fb8be10f11f831ee46885b6378f96fbb
Mar 20 14:34:38 nuc erigon[986]: [INFO] [03-20|14:34:38.363] [updateForkchoice] Fork choice update: flushing in-memory state (built by previous newPayload)
Mar 20 14:34:38 nuc erigon[986]: [INFO] [03-20|14:34:38.658] RPC Daemon notified of new headers       from=19476536 to=19476537 hash=0x37a1c155eaa2d740f2103c42a2465f20fb8be10f11f831ee46885b6378f96fbb header sending=13.673µs log sending=201ns
Mar 20 14:34:38 nuc erigon[986]: [INFO] [03-20|14:34:38.658] head updated                             hash=0x37a1c155eaa2d740f2103c42a2465f20fb8be10f11f831ee46885b6378f96fbb number=19476537
Mar 20 14:34:38 nuc erigon[986]: [INFO] [03-20|14:34:38.748] [snapshots] Prune Blocks                 to=18500000 limit=100
Mar 20 14:34:38 nuc erigon[986]: [WARN] [03-20|14:34:38.748] [snapshots] retire blocks                err="DumpBlocks: DumpBodies: header missed in db: block_num=18500000,  hash=80ce6bb0e244fbdf66cf0a1108273fe1ca58788efb7fb8d3a0d783d2b06d433d"
Mar 20 14:34:48 nuc erigon[986]: [INFO] [03-20|14:34:48.675] [NewPayload] Handling new payload        height=19476538 hash=0xb5c6395dd860ec4b8ef314dae67cee46b322bb94e50dbd77a84aad17944de777
Mar 20 14:34:49 nuc erigon[986]: [INFO] [03-20|14:34:49.355] [updateForkchoice] Fork choice update: flushing in-memory state (built by previous newPayload)
Mar 20 14:34:49 nuc erigon[986]: [INFO] [03-20|14:34:49.643] RPC Daemon notified of new headers       from=19476537 to=19476538 hash=0xb5c6395dd860ec4b8ef314dae67cee46b322bb94e50dbd77a84aad17944de777 header sending=14.43µs log sending=216ns
Mar 20 14:34:49 nuc erigon[986]: [INFO] [03-20|14:34:49.643] head updated                             hash=0xb5c6395dd860ec4b8ef314dae67cee46b322bb94e50dbd77a84aad17944de777 number=19476538
Mar 20 14:34:49 nuc erigon[986]: [INFO] [03-20|14:34:49.727] [snapshots] Prune Blocks                 to=18500000 limit=100
Mar 20 14:34:49 nuc erigon[986]: [WARN] [03-20|14:34:49.728] [snapshots] retire blocks                err="DumpBlocks: DumpBodies: header missed in db: block_num=18500000,  hash=80ce6bb0e244fbdf66cf0a1108273fe1ca58788efb7fb8d3a0d783d2b06d433d"

So there might be multiple issues.

Mar 22 '24 18:03 dylancaponi

Getting this again several hours after restart. Never finished syncing:

Mar 23 00:43:52 nuc erigon[585070]: [INFO] [03-23|00:42:47.572] [6/12 IntermediateHashes] Calculating Merkle root current key=d65e0322
Mar 23 00:43:52 nuc erigon[585070]: [INFO] [03-23|00:43:17.588] [6/12 IntermediateHashes] Calculating Merkle root current key=d66d67ab
Mar 23 00:43:52 nuc erigon[585070]: [INFO] [03-23|00:43:47.604] [6/12 IntermediateHashes] Calculating Merkle root current key=d67f8020
Mar 23 00:43:54 nuc erigon[585070]: [INFO] [03-23|00:43:54.232] [p2p] GoodPeers                          eth68=45 eth67=1
Mar 23 00:43:56 nuc erigon[585070]: [INFO] [03-23|00:43:56.359] [mem] memory stats                       Rss=30.6MB Size=0B Pss=30.6MB SharedClean=14.8KB SharedDirty=0B PrivateClean=25.4MB PrivateDirty=5.2MB Refer>
Mar 23 00:44:02 nuc erigon[585070]: [INFO] [03-23|00:44:02.013] [txpool] stat                            pending=0 baseFee=0 queued=30000 alloc=3.7GB sys=8.3GB
Mar 23 00:44:29 nuc erigon[585070]: [INFO] [03-23|00:44:18.321] [6/12 IntermediateHashes] Calculating Merkle root current key=d68f3ce9
Mar 23 00:44:47 nuc erigon[585070]: [INFO] [03-23|00:44:47.536] [6/12 IntermediateHashes] Calculating Merkle root current key=d697de19
Mar 23 00:45:18 nuc erigon[585070]: [INFO] [03-23|00:45:18.052] [6/12 IntermediateHashes] Calculating Merkle root current key=d6a597f2
Mar 23 00:46:02 nuc erigon[585070]: [INFO] [03-23|00:45:47.567] [6/12 IntermediateHashes] Calculating Merkle root current key=d6b5b4dd
Mar 23 00:46:12 nuc erigon[585070]: [WARN] [03-23|00:46:11.982] [engine] GetHeaderByHash                 err="ethereumExecutionModule.GetHeader: could not open database: context canceled"
Mar 23 00:46:12 nuc erigon[585070]: [WARN] [03-23|00:46:12.281] [engine] GetTd                           err="ethereumExecutionModule.GetHeader: could not open database: context canceled"
Mar 23 00:46:12 nuc erigon[585070]: [WARN] [03-23|00:46:12.281] [engine] CurrentHeader                   err="ethereumExecutionModule.CurrentHeader: could not open database: context canceled"

Mar 23 '24 03:03 dylancaponi

Upgraded to:

Erigon v2.59.3
Lighthouse v5.1.3-3058b96

Lighthouse hotfix had a note about forkchoice issues and I feel like I saw this in my logs so hopefully that will fix at least part of this issue.

🤞 🤞 🤞

Mar 29 '24 18:03 dylancaponi

@yperbasis happy to see the label. Let me know if I can provide more logs or do any debugging. I've had about 50% attestations vs missed in the last month 😢

Apr 24 '24 18:04 dylancaponi

I'm still unable to stay in sync. Here are stage times from my most recent attempt:

Jun 04 22:34:51 nuc erigon[87450]: [INFO] [06-04|22:34:51.048] [1/12 Snapshots] DONE                    in=6m10.611197956s
Jun 04 22:43:36 nuc erigon[87450]: [INFO] [06-04|22:43:36.394] [3/12 Senders] DONE                      in=3m32.91207169s
Jun 05 02:53:39 nuc erigon[87450]: [INFO] [06-05|02:53:39.319] [4/12 Execution] DONE                    in=4h10m2.924591551s
Jun 05 07:25:06 nuc erigon[87450]: [INFO] [06-05|07:25:06.130] [5/12 HashState] DONE                    in=4h31m26.810970751s
Jun 05 08:12:15 nuc erigon[87450]: [INFO] [06-05|08:12:15.014] [6/12 IntermediateHashes] DONE           in=47m8.883974433s
Jun 05 09:28:53 nuc erigon[87450]: [INFO] [06-05|09:28:53.808] [7/12 CallTraces] DONE                   in=1h16m38.79209474s
Jun 05 10:23:44 nuc erigon[87450]: [INFO] [06-05|10:23:44.496] [8/12 AccountHistoryIndex] DONE          in=54m50.68689461s
Jun 05 11:23:57 nuc erigon[87450]: [INFO] [06-05|11:23:57.046] [9/12 StorageHistoryIndex] DONE          in=1h0m12.549762117s
Jun 05 11:59:11 nuc erigon[87450]: [INFO] [06-05|11:59:11.868] [10/12 LogIndex] DONE                    in=35m14.821925573s
Jun 05 12:26:11 nuc erigon[87450]: [INFO] [06-05|12:26:11.759] [11/12 TxLookup] DONE                    in=26m59.890847703s
Jun 05 12:28:13 nuc erigon[87450]: [INFO] [06-05|12:28:13.157] [9/12 StorageHistoryIndex] Prune done    in=1m59.998599987s
Jun 05 12:50:25 nuc erigon[87450]: [INFO] [06-05|12:50:25.244] [8/12 AccountHistoryIndex] Prune done    in=22m12.087567395s
Jun 05 13:17:39 nuc erigon[87450]: [INFO] [06-05|13:17:39.817] [4/12 Execution] DONE                    in=18m25.532472159s
Jun 05 13:46:02 nuc erigon[87450]: [INFO] [06-05|13:46:02.584] [5/12 HashState] DONE                    in=28m22.767034598s
Jun 05 14:04:09 nuc erigon[87450]: [INFO] [06-05|14:04:09.089] [6/12 IntermediateHashes] DONE           in=18m6.50480355s
Jun 05 14:36:26 nuc erigon[87450]: [INFO] [06-05|14:36:26.094] [7/12 CallTraces] DONE                   in=32m17.004903009s
Jun 05 14:54:17 nuc erigon[87450]: [INFO] [06-05|14:54:17.458] [8/12 AccountHistoryIndex] DONE          in=17m51.364516027s
Jun 05 15:49:33 nuc erigon[87450]: [INFO] [06-05|15:49:33.311] [9/12 StorageHistoryIndex] DONE          in=55m15.853033727s
Jun 05 16:29:01 nuc erigon[87450]: [INFO] [06-05|16:29:01.357] [10/12 LogIndex] DONE                    in=39m28.045902965s
Jun 05 16:34:33 nuc erigon[87450]: [INFO] [06-05|16:34:33.958] [11/12 TxLookup] DONE                    in=5m32.600785363s
Jun 05 16:35:09 nuc erigon[87450]: [INFO] [06-05|16:35:09.270] [9/12 StorageHistoryIndex] Prune done    in=34.612729467s
Jun 05 16:59:06 nuc erigon[87450]: [INFO] [06-05|16:59:06.241] [8/12 AccountHistoryIndex] Prune done    in=23m56.970341548s
Jun 05 17:27:44 nuc erigon[87450]: [INFO] [06-05|17:27:44.016] [4/12 Execution] DONE                    in=21m49.530837755s
Jun 05 17:47:36 nuc erigon[87450]: [INFO] [06-05|17:47:36.035] [5/12 HashState] DONE                    in=19m52.018822307s
Jun 05 19:40:44 nuc erigon[87450]: [INFO] [06-05|19:40:44.396] [6/12 IntermediateHashes] DONE           in=1h53m8.360629727s

Jun 05 '24 20:06 dylancaponi

@dylancaponi show:

curl http://127.0.0.1:6060/debug/metrics/prometheus
du -hsc datadir/*
mdbx_stat -ef datadir/chaindata

Jun 06 '24 03:06 AskAlexSharov

Please re-open this issue if it still happens with v2.61.0 or later

Jan 16 '25 12:01 yperbasis