optimism icon indicating copy to clipboard operation
optimism copied to clipboard

l2geth hangs indefinitely when restarting during initial sync

Open kaladinlight opened this issue 2 years ago • 12 comments

Describe the bug

I have been attempting to sync an archive l2geth node for a long time now and have been unsuccessful due to sync hanging on restart of the node.

To Reproduce I don't have a way to reliably reproduce this case as it just seems to happen when shutdown doesn't store the correct state and/or startup doesn't handle recovering state accurately

Expected behavior l2geth picks up and continues sync gracefully

Screenshots If applicable, add screenshots to help explain your problem.

image image image image

System Specs:

  • OS: alpine linux
  • Package Version (or commit hash): ethereumoptimism/l2geth:0.5.31

Additional context

My thought is this has something to do with the fact that the block index was rolled back as described here:

INFO [01-20|16:48:30.706] Found latest index                       index=46973939
INFO [01-20|16:48:30.709] Block not found, resetting index         new=46973938 old=46973939
INFO [01-20|16:48:30.709] Found latest queue index                 queue-index=228857
INFO [01-20|16:48:30.731] Found correct staring queue index        queue-index=228857

and there is the corresponding short circuit in the resultLoop pictured above that prevents the block from ever being written to chain and the worker mainLoop just hangs waiting to read from chainHeadCh indefinitely as shown by the lack of Miner got new head debug log.

kaladinlight avatar Jan 20 '23 19:01 kaladinlight

Have you solved this problem?

zhaoxiangjunupi avatar Jan 29 '23 10:01 zhaoxiangjunupi

Have you solved this problem?

Negative. My assumption is that l2geth is all but abandoned in favor of the bedrock upgrade and I'm just waiting on snapshots for legacy geth come that time. Considering op-geth is forked from a more recent version on geth, hopefully there will be more node stability come bedrock.

kaladinlight avatar Jan 30 '23 15:01 kaladinlight

Yeah sorry about that, we're all hands on deck pushing out the Bedrock release. l2geth is based on an extremely old version of Geth and has lots of stability issues. I'm taking a quick look at this to see if I can figure out what's going on.

I think you're correct that likely what's happening is the following:

  • The block was written to DB but not added to the chain
  • The reset logic isn't detecting this case and removing the block from the DB
  • The short-circuit prevents the thing from starting again

smartcontracts avatar Feb 06 '23 21:02 smartcontracts

I'm going to create a canary release with an experimental fix that removes the short circuit. Just a sec...

smartcontracts avatar Feb 06 '23 22:02 smartcontracts

I'm going to create a canary release with an experimental fix that removes the short circuit. Just a sec...

Wow, rock on! Was not expecting much response here so I appreciate you taking the time to attempt a quick fix. :crossed_fingers:

And looking forward to the bedrock upgrade!

kaladinlight avatar Feb 06 '23 22:02 kaladinlight

Sorry for the delay here, got caught up with some other things. Can you try out this prerelease image and see if it fixes the deadlock? prerelease-0.0.0-rc-l2g-init-deadlock

smartcontracts avatar Feb 06 '23 23:02 smartcontracts

Sorry for the delay here, got caught up with some other things. Can you try out this prerelease image and see if it fixes the deadlock? prerelease-0.0.0-rc-l2g-init-deadlock

No worries, I will have to spin things back up as I had already tore them down to save some money, but will let you know when I can confirm the fix. Thanks again!

kaladinlight avatar Feb 07 '23 15:02 kaladinlight

Ok!

smartcontracts avatar Feb 07 '23 16:02 smartcontracts

Sure enough, block sync picks back up!

The only thing that catches my eye is Block not found, resetting index new=1547208 old=1547209, but then the first block synced is the old index (not new) Applying transaction to tip index=1547209. This makes sense enough given the change, but I am just wondering if there will be any possible missed data from 1547208. Can we be confident enough that all data was indexed from 1547208 if the stored block index had been incremented to 1547209 before shutdown? I am not familiar enough with geth internals to know if there is any sort of resync/reconfirm latest block on restart.

image

kaladinlight avatar Feb 08 '23 00:02 kaladinlight

Also, looking at go-ethereum v1.10.26, the check is still in the codebase: https://github.com/ethereum/go-ethereum/blob/v1.10.26/miner/worker.go#L692.

This makes me believe there could still be runtime consensus edge case around block resubmit that would be removed versus as potential fix during shutdown cleanup or more robust recovery logic instead. Block resubmit edge case may not be relevant with rollup arch and syncing from data-transport-layer, but figured I would raise the thought.

kaladinlight avatar Feb 08 '23 01:02 kaladinlight

it looks like this issue is solved right:)?

it looks like this issue is solved right:)?

Yes, the fix release candidate image prerelease-0.0.0-rc-l2g-init-deadlock does fix the deadlock. I just had some follow up edge case questions, but I haven't noticed any issues at this time.

kaladinlight avatar Feb 10 '23 17:02 kaladinlight

I am having this same issue, actually caused by not enough disk space. I have bumped the disk size but it doesnt come back syncing. What is the source branch of this image you posted above @smartcontracts ? I am building l2geth from source and need to rebase from that branch with the fix

--edit--

Found it here thank you very much, using this branch fixed and l2geth is progressing again

giovannirco avatar Mar 19 '23 14:03 giovannirco

I am running docker image prerelease-0.0.0-rc-l2g-init-deadlock on Goerli and the issue seems persisting. The sync gets stuck at block height 0x3df827

taxmeifyoucan avatar Mar 20 '23 17:03 taxmeifyoucan