erigon icon indicating copy to clipboard operation
erigon copied to clipboard

Upgrading from erigon 2.48.1 to 2.5x (currently trying 2.54.0) causes node to stop syncing

Open wwhchung opened this issue 2 years ago • 16 comments

System information

erigon version 2.54.0-aeec5221

OS & Version: Linux

Commit hash: tag - v2.54.0

Erigon Command (with flags/config): /usr/bin/erigon --config=/data/erigon/erigon.yaml

Consensus Layer: lighthouse Lighthouse v4.5.0-441fc16

Consensus Layer Command (with flags/config): /usr/bin/lighthouse bn --datadir /data/lighthouse --network mainnet --http --execution-endpoint http://127.0.0.1:8551 --execution-jwt /data/erigon/jwtsecret

Chain/Network: mainnet

Expected behaviour

Node properly syncs after version upgrarde

Actual behaviour

Node is no longer syncing, with messages: [WARN] [11-26|17:06:15.372] Served conn=127.0.0.1:26722 method=engine_forkchoiceUpdatedV2 reqid=1 t=68.793491ms err="unexpected chain cap: 18657009"

Steps to reproduce the behaviour

Full sync on v2.48.1, then upgrade to v2.54.0

Backtrace

N/A

wwhchung avatar Nov 26 '23 17:11 wwhchung

show "integration print_stages"

AskAlexSharov avatar Nov 27 '23 06:11 AskAlexSharov

We're seeing the same issue on our nodes.

Something I noticed on startup of erigon:

[INFO] [11-28|21:56:55.656] [db] chaindata                           sizeLimit=3TB pageSize=4096

When running integration print_stages:

INFO[11-28|22:00:48.990] logging to file system                   log dir=/blockchain/logs file prefix=integration log level=info json=false
INFO[11-28|22:00:48.992] [db] chaindata                           sizeLimit=2TB pageSize=4096
Note: prune_at doesn't mean 'all data before were deleted' - it just mean stage.Prune function were run to this block. Because 1 stage may prune multiple data types to different prune distance.

 				 stage_at 	 prune_at
Snapshots 			 18846895 	 0
Headers 			 18846895 	 0
BorHeimdall 			 0 		 0
BlockHashes 			 18846895 	 0
Bodies 				 18846895 	 0
Senders 			 18846895 	 0
Execution 			 18846895 	 18846895
Translation 			 0 		 0
HashState 			 18846895 	 0
IntermediateHashes 		 18846895 	 18846895
AccountHistoryIndex 		 18846895 	 0
StorageHistoryIndex 		 18846895 	 0
LogIndex 			 18846895 	 0
CallTraces 			 18846895 	 18846895
TxLookup 			 18846895 	 0
Finish 				 18846895 	 0
--
prune distance:

blocks.v2: blocks=0, segments=0, indices=0

history.v3: false,  idx steps: 0.00, lastMaxTxNum=0->1, lastBlockInSnap=0

sequence: EthTx=2063258825, NonCanonicalTx=63761

in db: first header 1, last header 18952054, first body 1, last body 18952054
--

bretep avatar Nov 28 '23 22:11 bretep

I'm observing the same on v2.55.1. Like for @bretep , the sizeLimit in container logs does not match the one from integration print_stages:

docker logs 2>&1 eth_erigon | grep chaindata
[INFO] [12-15|23:43:46.812] Opening Database                         label=chaindata path=/mnt/data/eth_erigon/core/chaindata
[INFO] [12-15|23:43:46.820] [db] chaindata                           sizeLimit=8TB pageSize=4096
/ $ integration print_stages --datadir /mnt/data/eth_erigon/core --chain mainnet
INFO[12-15|23:53:32.297] logging to file system                   log dir=/mnt/data/eth_erigon/core/logs file prefix=integration log level=info json=false
INFO[12-15|23:53:32.298] [db] chaindata                           sizeLimit=2TB pageSize=4096
INFO[12-15|23:53:35.581] [snapshots] Blocks Stat                  blocks=18518k indices=18518k alloc=2.2GB sys=2.3GB
INFO[12-15|23:53:35.584] [bor snapshots] Blocks Stat              blocks=0k indices=0k alloc=2.2GB sys=2.3GB
Note: prune_at doesn't mean 'all data before were deleted' - it just mean stage.Prune function were run to this block. Because 1 stage may prune multiple data types to different prune distance.

                                 stage_at        prune_at
Snapshots                        18608589        0
Headers                          18794467        0
BorHeimdall                      0               0
BlockHashes                      18794467        0
Bodies                           18608589        0
Senders                          18608589        0
Execution                        18608589        18608589
Translation                      0               0
HashState                        18608589        0
IntermediateHashes               18608589        18608589
AccountHistoryIndex              18608589        0
StorageHistoryIndex              18608589        0
LogIndex                         18608589        0
CallTraces                       18608589        18608589
TxLookup                         18608589        18518000
Finish                           18608589        0
--
prune distance: 

blocks.v2: true, blocks=18517999, segments=18517999, indices=18517999

history.v3: false,  idx steps: 0.00, lastMaxTxNum=0->0, lastBlockInSnap=0

sequence: EthTx=2199760036, NonCanonicalTx=1870330

in db: first header 18518000, last header 18794904, first body 18518000, last body 18794904

eldarbogdanov avatar Dec 15 '23 23:12 eldarbogdanov

what's the solution to this?

B1Q avatar Jan 11 '24 16:01 B1Q

Also seeing this recently. Tried updating, but no success. Currently on the latest v2.56.2, using external Lighthouse CL.

[INFO] [01-15|08:56:50.254] [NewPayload] Handling new payload        height=19011236 hash=0xd7ff5431c668a10b655270ccec26a6b1dd0a1bb6731033fdbc1a706447f18150
[WARN] [01-15|08:56:50.752] [rpc] served                             conn=127.0.0.1:35290 method=engine_forkchoiceUpdatedV2 reqid=1 t=156.596579ms err="unexpected chain cap: 19010820"

antonok-edm avatar Jan 15 '24 09:01 antonok-edm

I just upgraded one of my nodes from v2.46.1 to v2.56.2 and got the same error "unexpected chain cap" above. I believe the cause is Erigon was being updated while the CL is running. I had to restore Erigon's data from another node, and followed this upgrade procedure instead:

- stop the CL
- upgrade Erigon (check logs and wait until everything has settled down)
- start CL

This worked for me, and now Erigon's synced up to head again. Hope this helps someone else.

keithchew avatar Jan 19 '24 21:01 keithchew

I had tried that (CL and Erigon stopped) and still encountered this error.

wwhchung avatar Jan 19 '24 21:01 wwhchung

Did you wait for Erigon to fully settle down? I waited for it to complete the snapshot phase, which took over a day.

keithchew avatar Jan 19 '24 22:01 keithchew

From integration print_stages, I can see the below. Previously with the error, all stages were at 19036913 without progress. But with the procedure above, headers and exec have moved forward (although snapshots stage is still at previous value):

                                 stage_at        prune_at
Snapshots                        19036913        0
Headers                          19036947        0
BorHeimdall                      0               0
BlockHashes                      19036947        0
Bodies                           19036947        0
Senders                          19036947        0
Execution                        19036947        19036947
Translation                      0               0
HashState                        19036947        0
IntermediateHashes               19036947        19036947
AccountHistoryIndex              19036947        0
StorageHistoryIndex              19036947        0
LogIndex                         19036947        0
CallTraces                       19036947        19036947
TxLookup                         19036947        18946000
Finish                           19036947        0

keithchew avatar Jan 19 '24 22:01 keithchew

Just an update on the above, the snapshot stage eventually kicked in and caught up:

                                 stage_at        prune_at
Snapshots                        19058024        0
Headers                          19058025        0
BorHeimdall                      0               0
BlockHashes                      19058025        0
Bodies                           19058025        0
Senders                          19058025        0
Execution                        19058025        19058025
Translation                      0               0
HashState                        19058025        0
IntermediateHashes               19058025        19058025
AccountHistoryIndex              19058025        0
StorageHistoryIndex              19058025        0
LogIndex                         19058025        0
CallTraces                       19058025        19058025
TxLookup                         19058025        18967160
Finish                           19058025        0
--

keithchew avatar Jan 21 '24 22:01 keithchew

Resolution to this for me was to start with --db.size.limit=6000000000000

bretep avatar Jan 21 '24 22:01 bretep

How do you get this output info?

Just an update on the above, the snapshot stage eventually kicked in and caught up:

                                 stage_at        prune_at
Snapshots                        19058024        0
Headers                          19058025        0
BorHeimdall                      0               0
BlockHashes                      19058025        0
Bodies                           19058025        0
Senders                          19058025        0
Execution                        19058025        19058025
Translation                      0               0
HashState                        19058025        0
IntermediateHashes               19058025        19058025
AccountHistoryIndex              19058025        0
StorageHistoryIndex              19058025        0
LogIndex                         19058025        0
CallTraces                       19058025        19058025
TxLookup                         19058025        18967160
Finish                           19058025        0
--

wwhchung avatar Jan 21 '24 22:01 wwhchung

@bretep I believe --db.size.limit is another issue, ie breaking change introduced in v2.55.0:

[v2.55.0](https://github.com/ledgerwatch/erigon/releases/tag/v2.55.0)
Breaking Change
Existing nodes whose MDBX page size equals 4kb must add --db.size.limit=8TB flag. Otherwise you will get MDBX_TOO_LARGE error. To check the current page size you can use make db-tools && ./build/bin/mdbx_stat datadir/chaindata

@wwhchung as stated above, you can use the integration print_stages command to get the output, eg:

./build/bin/integration print_stages --datadir /root/.ethereum --chain mainnet

keithchew avatar Jan 21 '24 22:01 keithchew

+1

zformular avatar Mar 11 '24 02:03 zformular

+1

same issue, i tried state_stages unwind=100, log shows

INFO[03-12|01:51:07.379] [snapshots:download] Blocks Stat blocks=19305k indices=19305k alloc=3.3GB sys=4.4GB DBUG[03-12|01:51:07.379] [1/12 Snapshots] DONE in=2.138136723s DBUG[03-12|01:51:07.379] [2/12 BlockHashes] DONE in=18.278µs DBUG[03-12|01:51:07.379] [3/12 Senders] DONE in=100.382µs INFO[03-12|01:51:07.379] [4/12 Execution] Blocks execution from=19395808 to=19395908 DBUG[03-12|01:51:14.673] [snapshots.webseed] get from HTTP provider err="webseed.http: Get "https://erigon3-v1-snapshots-mainnet.erigon.network/manifest.txt": tls: failed to verify certificate: x509: certificate signed by unknown authority, host=erigon3-v1-snapshots-mainnet.erigon.network, url=" url= DBUG[03-12|01:51:15.230] [snapshots.webseed] get from HTTP provider err="webseed.http: Get "https://erigon3-v3-snapshots-mainnet.erigon.network/manifest.txt": tls: failed to verify certificate: x509: certificate signed by unknown authority, host=erigon3-v3-snapshots-mainnet.erigon.network, url=" url= INFO[03-12|01:51:19.427] [4/12 Execution] Completed on block=19395908 DBUG[03-12|01:51:19.427] [4/12 Execution] DONE in=12.048029851s INFO[03-12|01:51:19.427] [5/12 HashState] Promoting plain state from=19395808 to=19395908 INFO[03-12|01:51:19.427] [5/12 HashState] Incremental promotion from=19395808 to=19395908 codes=true csbucket=AccountChangeSet INFO[03-12|01:51:19.542] [5/12 HashState] Incremental promotion from=19395808 to=19395908 codes=false csbucket=AccountChangeSet INFO[03-12|01:51:19.709] [5/12 HashState] Incremental promotion from=19395808 to=19395908 codes=false csbucket=StorageChangeSet DBUG[03-12|01:51:20.413] [5/12 HashState] DONE in=985.822563ms INFO[03-12|01:51:20.413] [6/12 IntermediateHashes] Generating intermediate hashes from=19395808 to=19395908 DBUG[03-12|01:51:20.660] [txpool] Commit written_kb=0 in=3.927648ms DBUG[03-12|01:51:22.959] [6/12 IntermediateHashes] DONE in=2.546475506s DBUG[03-12|01:51:23.545] [7/12 CallTraces] DONE in=585.147029ms DBUG[03-12|01:51:23.916] [8/12 AccountHistoryIndex] DONE in=371.681863ms DBUG[03-12|01:51:24.583] [9/12 StorageHistoryIndex] DONE in=667.066738ms DBUG[03-12|01:51:24.902] [snapshots] no metadata yet files=9 list=v1-018400-018500-bodies.seg,v1-018400-018500-headers.seg,v1-018300-018400-bodies.seg,v1-018300-018400-transactions.seg,v1-018300-018400-headers.seg,... DBUG[03-12|01:51:25.183] [10/12 LogIndex] DONE in=599.86347ms DBUG[03-12|01:51:25.569] [11/12 TxLookup] DONE in=385.539293ms DBUG[03-12|01:51:25.572] [12/12 Finish] DONE in=3.554721ms INFO[03-12|01:51:26.001] [snapshots] Prune Blocks to=19305000 limit=100 DBUG[03-12|01:51:36.091] [txpool] Commit written_kb=0 in=10.929402ms DBUG[03-12|01:51:38.859] [NewPayload] sending block height=19415836 hash=0x374a0620eef1627420b8a5e2c43e6fd41f02a4e281546c4ad09987f354fd3729 INFO[03-12|01:51:38.859] [NewPayload] Handling new payload height=19415836 hash=0x374a0620eef1627420b8a5e2c43e6fd41f02a4e281546c4ad09987f354fd3729 DBUG[03-12|01:51:38.859] [NewPayload] New payload: need to download parent height=19415836 hash=0x374a0620eef1627420b8a5e2c43e6fd41f02a4e281546c4ad09987f354fd3729 parentHash=0xec6a48eaa55491ab9d87749ae9dccf22f1a6c90f9c763fb685b99ffd6ae4e29c INFO[03-12|01:51:38.859] [EngineBlockDownloader] Downloading PoS headers... hash=0xec6a48eaa55491ab9d87749ae9dccf22f1a6c90f9c763fb685b99ffd6ae4e29c requestId=0 DBUG[03-12|01:51:38.859] [downloader] Set posAnchor blockHeight=1 DBUG[03-12|01:51:39.918] [NewPayload] got reply payloadStatus="&{Status:SYNCING ValidationError: LatestValidHash: CriticalError:}" DBUG[03-12|01:51:40.409] [ForkChoiceUpdated] Downloading some other PoS stuff hash=0x374a0620eef1627420b8a5e2c43e6fd41f02a4e281546c4ad09987f354fd3729 DBUG[03-12|01:51:44.902] [snapshots] no metadata yet files=9 list=v1-018400-018500-bodies.seg,v1-018400-018500-headers.seg,v1-018300-018400-bodies.seg,v1-018300-018400-transactions.seg,v1-018300-018400-headers.seg,... DBUG[03-12|01:51:50.293] [NewPayload] Downloading some other PoS blocks hash=0xb1b09809cc492ff632afc58b77a2890c35ec8d84d6e37f58eb04f6cf6558cc28 DBUG[03-12|01:51:51.506] [txpool] Commit written_kb=0 in=974.341µs DBUG[03-12|01:51:55.170] [ForkChoiceUpdated] Downloading some other PoS stuff hash=0xb1b09809cc492ff632afc58b77a2890c35ec8d84d6e37f58eb04f6cf6558cc28 DBUG[03-12|01:52:01.517] [NewPayload] Downloading some other PoS blocks hash=0x294a477869c9a86091ba033947a95dfc858c52460e5cc03d3ae546555b3be0b2 DBUG[03-12|01:52:01.776] [ForkChoiceUpdated] Downloading some other PoS stuff hash=0x294a477869c9a86091ba033947a95dfc858c52460e5cc03d3ae546555b3be0b2 DBUG[03-12|01:52:04.902] [snapshots] no metadata yet files=8 list=v1-018400-018500-bodies.seg,v1-018300-018400-bodies.seg,v1-018300-018400-transactions.seg,v1-018300-018400-headers.seg,v1-018900-019000-bodies.seg,... DBUG[03-12|01:52:05.254] [p2p] Server protocol=67 peers=0 trusted=0 inbound=0 DBUG[03-12|01:52:05.254] [p2p] Dial scheduler protocol=67 peers=0/33 tried=22 static=0 DBUG[03-12|01:52:05.254] [p2p] Discovery table protocol=67 version=v4 len=143 live=6 unsol=281 ips=199 db=30 reval=10 RPC timeout=4 unsolicited reply=43 DBUG[03-12|01:52:05.256] [p2p] Discovery table protocol=68 version=v4 len=131 live=7 unsol=0 ips=184 db=30 reval=10 RPC timeout=2 DBUG[03-12|01:52:05.256] [p2p] Server protocol=68 peers=0 trusted=0 inbound=0 EOF=2 DBUG[03-12|01:52:05.256] [p2p] Dial scheduler protocol=68 peers=0/33 tried=21 static=0 DBUG[03-12|01:52:06.929] [txpool] Commit written_kb=0 in=638.965µs DBUG[03-12|01:52:13.494] [NewPayload] Downloading some other PoS blocks hash=0x1d07375a7f91930877f51528ae584622b425514c76cad48b58d4ca38a56718d3 DBUG[03-12|01:52:13.963] [ForkChoiceUpdated] Downloading some other PoS stuff hash=0x1d07375a7f91930877f51528ae584622b425514c76cad48b58d4ca38a56718d3 DBUG[03-12|01:52:22.354] [txpool] Commit written_kb=0 in=681.311µs DBUG[03-12|01:52:22.849] [ForkChoiceUpdated] Downloading some other PoS stuff hash=0x1d07375a7f91930877f51528ae584622b425514c76cad48b58d4ca38a56718d3

zformular avatar Mar 12 '24 01:03 zformular

Did you wait for Erigon to fully settle down? I waited for it to complete the snapshot phase, which took over a day.

That did it. Thanks. Upgrade successful.

wwhchung avatar Mar 12 '24 02:03 wwhchung

Please re-open this issue if it still happens with v2.59.2 or later.

yperbasis avatar Apr 09 '24 10:04 yperbasis