go-ethereum I got "Head state missing, repairing" and the node does sync blocks any more

Hi, when I restart my geth node, I got the warn message and the node does not sync any more. It has been many hours.

WARN [05-30|13:04:36.330] Head state missing, repairing            number=14,463,337 hash=4f01d4..959961 snaproot=980390..e6434b

The whole message after the node start:

INFO [05-30|13:04:14.312] Starting Geth on Ethereum mainnet...
INFO [05-30|13:04:14.314] Maximum peer count                       ETH=50 LES=0 total=50
INFO [05-30|13:04:14.314] Smartcard socket not found, disabling    err="stat /run/pcscd/pcscd.comm: no such file or directory"
INFO [05-30|13:04:14.316] Set global gas cap                       cap=50,000,000
INFO [05-30|13:04:14.330] Allocated trie memory caches             clean=307.00MiB dirty=512.00MiB
INFO [05-30|13:04:14.330] Allocated cache and file handles         database=/data/eth_main/geth/chaindata cache=1024.00MiB handles=32767
INFO [05-30|13:04:23.073] Opened ancient database                  database=/data/eth_main/geth/chaindata/ancient readonly=false
INFO [05-30|13:04:23.249] Persisted trie from memory database      nodes=12356 size=1.78MiB time=25.823995ms gcnodes=0 gcsize=0.00B gctime=0s livenodes=1 livesize=0.00B
INFO [05-30|13:04:23.250] Initialised chain configuration          config="{ChainID: 1 Homestead: 1150000 DAO: 1920000 DAOSupport: true EIP150: 2463000 EIP155: 2675000 EIP158: 2675000 Byzantium: 4370000 Constantinople: 7280000 Petersburg: 7280000 Istanbul: 9069000, Muir Glacier: 9200000, Berlin: 12244000, London: 12965000, Arrow Glacier: 13773000, MergeFork: <nil>, Terminal TD: <nil>, Engine: ethash}"
INFO [05-30|13:04:35.570] Disk storage enabled for ethash caches   dir=/data/eth_main/geth/ethash count=3
INFO [05-30|13:04:35.570] Disk storage enabled for ethash DAGs     dir=/home/java-tron/.ethash    count=2
INFO [05-30|13:04:35.574] Initialising Ethereum protocol           network=1 dbversion=8
INFO [05-30|13:04:36.327] Loaded most recent local header          number=14,463,337 hash=4f01d4..959961 td=44,767,534,003,696,042,413,376 age=2mo4d11h
INFO [05-30|13:04:36.327] Loaded most recent local full block      number=14,463,337 hash=4f01d4..959961 td=44,767,534,003,696,042,413,376 age=2mo4d11h
INFO [05-30|13:04:36.327] Loaded most recent local fast block      number=14,463,337 hash=4f01d4..959961 td=44,767,534,003,696,042,413,376 age=2mo4d11h
WARN [05-30|13:04:36.330] Head state missing, repairing            number=14,463,337 hash=4f01d4..959961 snaproot=980390..e6434b

When I stop the node using kill pid , I got the following message:

INFO [05-30|11:45:18.054] Got interrupt, shutting down...
INFO [05-30|11:45:18.059] HTTP server stopped                      endpoint=[::]:8545
INFO [05-30|11:45:18.059] IPC endpoint closed                      url=/data/eth_main/geth.ipc
INFO [05-30|11:45:18.747] Downloader queue stats                   receiptTasks=0 blockTasks=33407 itemSize=82.43KiB throttle=3181
WARN [05-30|11:45:18.747] Synchronisation failed, retrying         err="content processing canceled (requested)"
INFO [05-30|11:45:18.749] Ethereum protocol stopped
INFO [05-30|11:45:18.749] Transaction pool stopped
INFO [05-30|11:45:18.749] Waiting background transaction indexer to exit
WARN [05-30|11:45:18.768] Already shutting down, interrupt more to panic. times=9
ERROR[05-30|11:45:18.791] Failed to journal state snapshot         err="snapshot [0xaae82e4b2b4223051bfa8f75f11434d712e0c1227d6c2bbd395637682fe03833] missing"
INFO [05-30|11:45:18.792] Writing cached state to disk             block=14,463,337 hash=4f01d4..959961 root=aae82e..e03833
WARN [05-30|11:45:19.292] Already shutting down, interrupt more to panic. times=8
WARN [05-30|11:45:25.001] Already shutting down, interrupt more to panic. times=7
WARN [05-30|11:45:25.815] Already shutting down, interrupt more to panic. times=6
WARN [05-30|11:45:26.867] Already shutting down, interrupt more to panic. times=5
WARN [05-30|11:45:27.410] Already shutting down, interrupt more to panic. times=4
WARN [05-30|11:45:27.929] Already shutting down, interrupt more to panic. times=3
WARN [05-30|11:45:28.428] Already shutting down, interrupt more to panic. times=2
WARN [05-30|11:45:28.883] Already shutting down, interrupt more to panic. times=1
INFO [05-30|11:45:29.346] Looking for peers                        peercount=0 tried=114 static=0
panic: boom

Before I restart the node, it also has the repairing log, but it just continue minutes and go on.

WARN [05-26|17:17:29.280] Head state missing, repairing            number=13,773,036 hash=feb273..d9e13d snaproot=a1e813..4813c0
INFO [05-26|17:20:04.406] Loaded most recent local header          number=13,773,036 hash=feb273..d9e13d td=36,206,751,599,115,524,359,527 age=5mo2w2d
INFO [05-26|17:20:04.406] Loaded most recent local full block      number=13,743,348 hash=17974a..337279 td=35,855,860,751,110,670,126,484 age=5mo3w1d
INFO [05-26|17:20:04.406] Loaded most recent local fast block      number=13,773,036 hash=feb273..d9e13d td=36,206,751,599,115,524,359,527 age=5mo2w2d
WARN [05-26|17:20:04.407] Enabling snapshot recovery               chainhead=13,743,348 diskbase=13,772,860
WARN [05-26|17:20:04.546] Snapshot is not continuous with chain    snaproot=52c356..2fac0d chainroot=a3aecc..32d9d7
INFO [05-26|17:20:04.546] Loaded local transaction journal         transactions=0 dropped=0
INFO [05-26|17:20:04.546] Regenerated local transaction journal    transactions=0 accounts=0
INFO [05-26|17:20:04.550] Gasprice oracle is ignoring threshold set threshold=2
WARN [05-26|17:20:04.553] Unclean shutdown detected                booted=2021-11-14T17:37:46+0800 age=6mo1w5d
INFO [05-26|17:20:04.553] Starting peer-to-peer node               instance=Geth/ddblock/v1.10.9-stable-eae3b194/linux-amd64/go1.17
INFO [05-26|17:20:04.835] New local node record                    seq=1,634,206,267,554 id=e541d3a5d36734f6 ip=127.0.0.1 udp=30303 tcp=30303

So, how can I get the node sync again? And why I got the error?

The start command is

nohup geth --mainnet --syncmode "full"  --datadir /data/eth_main --cache 2048 --identity "ddblock"  --http  --http.api eth,net,web3  --http.corsdomain "*" --http.port 8545 --port 30303 --http.addr 0.0.0.0  --http.vhosts "*" >> eth_main.out 2>&1 &

I used the geth-linux-amd64-1.10.17-25c9b49f and also tried to update to geth-linux-amd64-1.10.18-de23cf91, but it does not work.

May 30 '22 07:05 shiziwen

After about 20 hours, the node sync from number 0.

WARN [05-30|11:45:32.608] Head state missing, repairing            number=14,767,004 hash=6371c1..2fb12a sna
proot=1c3393..736d02
INFO [05-31|08:15:42.363] Loaded most recent local header          number=0          hash=d4e567..cb8fa3 td=
17,179,869,184                 age=53y2mo3d
INFO [05-31|08:15:42.363] Loaded most recent local full block      number=0          hash=d4e567..cb8fa3 td=
17,179,869,184                 age=53y2mo3d
INFO [05-31|08:15:42.363] Loaded most recent local fast block      number=0          hash=d4e567..cb8fa3 td=
17,179,869,184                 age=53y2mo3d
WARN [05-31|08:15:42.748] Enabling snapshot recovery               chainhead=0 diskbase=14,771,503
WARN [05-31|08:15:42.969] Snapshot is not continuous with chain    snaproot=60290b..bb096e chainroot=d7f897.
.0f0544
INFO [05-31|08:15:42.977] Loaded local transaction journal         transactions=0 dropped=0
INFO [05-31|08:15:42.978] Regenerated local transaction journal    transactions=0 accounts=0
INFO [05-31|08:15:42.979] Gasprice oracle is ignoring threshold set threshold=2
WARN [05-31|08:15:42.980] Unclean shutdown detected                booted=2021-10-14T18:14:34+0800 age=7mo2w
4d
WARN [05-31|08:15:42.980] Unclean shutdown detected                booted=2022-05-30T11:25:13+0800 age=20h50
m29s
WARN [05-31|08:15:42.980] Unclean shutdown detected                booted=2022-05-30T11:43:15+0800 age=20h32
m27s
INFO [05-31|08:15:42.983] Starting peer-to-peer node               instance=Geth/ddblock/v1.10.17-stable-25c
9b49f/linux-amd64/go1.18
INFO [05-31|08:15:43.173] New local node record                    seq=1,634,206,474,731 id=9bb93ce38e8e2e21
 ip=127.0.0.1 udp=30303 tcp=30303
INFO [05-31|08:15:43.174] Started P2P networking                   self=enode://e39a1c7cccece4511de8962277fb
4bbd9fdc55bf8cc75b23521846633d7fc320065e5a4d1bb85fb12073c510c7e6c9e0f84ad988cb3a84063e32f992ca8fa09b@127.0.0
.1:30303
INFO [05-31|08:15:43.176] IPC endpoint opened                      url=/data/eth_main/geth.ipc
INFO [05-31|08:15:43.176] HTTP server started                      endpoint=[::]:8545 auth=false prefix= cor
s=* vhosts=*
INFO [05-31|08:15:47.023] New local node record                    seq=1,634,206,474,732 id=9bb93ce38e8e2e21
 ip=47.253.61.22 udp=30303 tcp=30303
INFO [05-31|08:15:54.015] Looking for peers                        peercount=0 tried=12 static=0
INFO [05-31|08:16:04.021] Looking for peers                        peercount=0 tried=24 static=0
INFO [05-31|08:16:14.682] Looking for peers                        peercount=0 tried=29 static=0
INFO [05-31|08:16:24.682] Looking for peers                        peercount=0 tried=38 static=0
INFO [05-31|08:16:24.730] Block synchronisation started
INFO [05-31|08:16:25.166] Downloader queue stats                   receiptTasks=0 blockTasks=189 itemSize=11
1.09B throttle=8192
INFO [05-31|08:16:25.983] Imported new chain segment               blocks=2 txs=0 mgas=0.000 elapsed=817.163
ms mgasps=0.000 number=2          hash=b495a1..4698c9 age=6y11mo6d  dirty=4.65KiB
INFO [05-31|08:16:25.986] Indexed transactions                     blocks=3 txs=0 tail=0 elapsed="109.398µs"
INFO [05-31|08:16:30.086] Imported new chain segment               blocks=2048 txs=0 mgas=0.000 elapsed=4.08
8s      mgasps=0.000 number=2050       hash=2c83bc..497041 age=6y11mo6d  dirty=485.39KiB
INFO [05-31|08:16:32.645] Imported new chain segment               blocks=2048 txs=0 mgas=0.000 elapsed=2.54
9s      mgasps=0.000 number=4098       hash=3a497e..040fe3 age=6y11mo6d  dirty=576.57KiB
INFO [05-31|08:16:34.710] Imported new chain segment               blocks=2048 txs=0 mgas=0.000 elapsed=2.05
5s      mgasps=0.000 number=6146       hash=1d6dfe..771ede age=6y11mo6d  dirty=593.28KiB

May 31 '22 02:05 shiziwen

These logs are a bit strange, they are not in order and they seem to be missing Geth runs from in between them. It's extremely hard to see what's wrong when we don't have the full picture of what you've been doing.

All in all Geth really hates it if it's killed forcefully, like you did in one of your logs. Such terminations lead to state losses, which are in general recoverable, but if you kill it again during recovery, things just get worse.

If you have all the logs, starting from a healthy point going all the way down to the last failure of a rewind to genesis, we might be able to make heads or tails of it, but in the current case we're missing too much information.

Jun 16 '22 08:06 karalabe

I am having the same problem when the kubernetes pod needs to restart. I can't even upgrade versions without it having to repair the chain.

I tried to increase the Graceperiod for the pod, but it still kills itself off and has to come up in repair mode.

Feb 28 '24 14:02 Tizull