erigon Bug on attempting to restart blockchain sync

System information

Erigon version: downloaded 2 days ago

OS & Version: Windows server 2012 R2 in a Hyper-V VM

Commit hash :

Expected behaviour

resumption of blockchain sync

Actual behaviour

INFO[08-15|15:54:39.586] Build info git_branch=dev el git_tag=v2021.10.03-1670-g52987a56f git_commit=52987a56f30f3e784e848fae2445fd 03af903f80 INFO[08-15|15:54:39.596] Starting Erigon on Ethereum mainnet... INFO[08-15|15:54:39.607] Maximum peer count ETH=100 total= 100 INFO[08-15|15:54:39.612] starting HTTP APIs APIs=eth,erigo n,engine INFO[08-15|15:54:39.614] torrent verbosity level=WRN INFO[08-15|15:54:41.747] Set global gas cap cap=50000000 INFO[08-15|15:54:42.052] Opening Database label=chaindat a path=D:\Erigon\chaindata INFO[08-15|15:54:42.278] Initialised chain configuration config="{Chain ID: 1, Homestead: 1150000, DAO: 1920000, DAO Support: true, Tangerine Whistle: 2 463000, Spurious Dragon: 2675000, Byzantium: 4370000, Constantinople: 7280000, P etersburg: 7280000, Istanbul: 9069000, Muir Glacier: 9200000, Berlin: 12244000, London: 12965000, Arrow Glacier: 13773000, Gray Glacier: 15050000, Terminal Tota l Difficulty: 58750000000000000000000, Merge Netsplit: , Engine: ethash}" g enesis=0xd4e56740f876aef8c010b86a40d5f56745a118d0906a34e69aec8c0db1cb8fa3 INFO[08-15|15:54:42.283] Effective prune_flags= s napshot_flags="--snapshots=true" history.v2=false INFO[08-15|15:54:52.306] Initialising Ethereum protocol network=1 INFO[08-15|15:54:52.311] Disk storage enabled for ethash DAGs dir=D:\Erigon \ethash-dags count=2 INFO[08-15|15:54:52.473] Starting private RPC server on=127.0.0.1:9 090 INFO[08-15|15:54:52.477] new subscription to logs established INFO[08-15|15:54:52.484] rpc filters: subscribing to Erigon events EROR[08-15|15:54:52.494] catch panic err="runtime e rror: invalid memory address or nil pointer dereference" stack="[main.go:29 pani c.go:884 panic.go:260 signal_windows.go:255 index.go:149 block_snapshots.go:179 config.go:245 block_snapshots.go:286 config.go:243 backend.go:566 node.go:115 ma in.go:58 app.go:526 app.go:286 main.go:34 proc.go:250 asm_amd64.s:1594]"

Steps to reproduce the behaviour

First run was successfully syncing after executing this: C:\Users\Blockchain\Desktop>C:\Users\Blockchain\erigon\build\bin\erigon.exe --datadir D:\Erigon

I had to restart the VM before the sync completed, so I hit CTRL-C to stop Erigon, which blew up spectacularly in a bunch of broken file handles. I restarted the machine and tried to execute the same as above, but now it gives the output I indicated further above.

Removing all the data and starting over worked. I assume that there'd be a similar outcome if the VM crashed. Is there no fault tolerance to Erigon? What's the correct way to stop Erigon?

Backtrace

[backtrace]

Aug 15 '22 20:08 Sean4572435243

it's safe to kill erigon at any time
" bunch of broken file handles. " - what was the error message?
Fixed "nil pointer" problem by: https://github.com/ledgerwatch/erigon/pull/5070

Aug 16 '22 01:08 AskAlexSharov

I had the same problem after starting syncing last night. I was running /build/bin/erigon --prune htrc I had to stop syncing because there was no more seeder to the snapshots

Aug 16 '22 11:08 louisguthmannStarkWare

I didn't think to grab the exact error text, apologies, but when I hit CTRL-C, the cmd window spewed a few dozen 'invalid file handle' errors with no more description.

Thanks for the quick bug fix and for being so interactive. Amazing team :)

Aug 16 '22 13:08 Sean4572435243

In the wmake file, still have a problem with

-tags nosqlite,noboltdb

not recognizing noboltdb as a second parameter. I originally thought having multiple -tags parameters solved it, but I see now that it ignores the first -tags and the build incorporates the sqllite component. Don't have a workaround

Aug 16 '22 13:08 Sean4572435243

@Sean4572435243

I've been able to partially reproduce the issue. In particular hitting CTRL+C during snapshot downloads I also get a bunch of errors related to invalid file handles as :

WARN[08-17|08:26:06.921] "v1-006000-006500-bodies.seg": error closing storage: FlushFileBuffers: The handle is invalid.
WARN[08-17|08:26:06.928] "v1-012500-013000-bodies.seg": error closing storage: FlushFileBuffers: The handle is invalid.
WARN[08-17|08:26:06.931] "v1-009500-010000-headers.seg": error closing storage: FlushFileBuffers: Incorrect function.
WARN[08-17|08:26:06.933] "v1-002000-002500-headers.seg": error closing storage: FlushFileBuffers: The handle is invalid.
WARN[08-17|08:26:06.942] "v1-009000-009500-headers.seg": error closing storage: FlushFileBuffers: The handle is invalid.
WARN[08-17|08:26:06.953] "v1-004000-004500-headers.seg": error closing storage: FlushFileBuffers: The handle is invalid.
WARN[08-17|08:26:06.953] "v1-007500-008000-bodies.seg": error closing storage: FlushFileBuffers: The handle is invalid.
WARN[08-17|08:26:06.953] "v1-000000-000500-transactions.seg": error closing storage: FlushFileBuffers: The handle is invalid.
WARN[08-17|08:26:06.953] "v1-010000-010500-bodies.seg": error closing storage: FlushFileBuffers: The handle is invalid.
WARN[08-17|08:26:06.953] "v1-005000-005500-bodies.seg": error closing storage: FlushFileBuffers: The handle is invalid.
WARN[08-17|08:26:06.957] "v1-010500-011000-bodies.seg": error closing storage: FlushFileBuffers: The handle is invalid.
WARN[08-17|08:26:06.957] "v1-003500-004000-transactions.seg": error closing storage: FlushFileBuffers: The handle is invalid.
WARN[08-17|08:26:06.960] "v1-002500-003000-headers.seg": error closing storage: FlushFileBuffers: The handle is invalid.
WARN[08-17|08:26:06.961] "v1-004500-005000-bodies.seg": error closing storage: FlushFileBuffers: The handle is invalid.
WARN[08-17|08:26:06.990] "v1-004000-004500-transactions.seg": error closing storage: FlushFileBuffers: The handle is invalid.
WARN[08-17|08:26:06.991] "v1-004500-005000-headers.seg": error closing storage: FlushFileBuffers: The handle is invalid.
WARN[08-17|08:26:06.993] "v1-010500-011000-headers.seg": error closing storage: FlushFileBuffers: The handle is invalid.
WARN[08-17|08:26:07.165] "v1-005500-006000-headers.seg": error closing storage: FlushFileBuffers: The handle is invalid.
WARN[08-17|08:26:07.176] "v1-011500-012000-bodies.seg": error closing storage: FlushFileBuffers: The handle is invalid.
WARN[08-17|08:26:07.186] "v1-002500-003000-transactions.seg": error closing storage: FlushFileBuffers: Incorrect function.
WARN[08-17|08:26:07.196] "v1-007000-007500-headers.seg": error closing storage: FlushFileBuffers: The handle is invalid.
WARN[08-17|08:26:07.196] "v1-014500-015000-headers.seg": error closing storage: FlushFileBuffers: The handle is invalid.
WARN[08-17|08:26:07.212] "v1-001500-002000-bodies.seg": error closing storage: FlushFileBuffers: The handle is invalid.
WARN[08-17|08:26:07.223] "v1-014000-014500-bodies.seg": error closing storage: FlushFileBuffers: The handle is invalid.
WARN[08-17|08:26:07.224] "v1-011000-011500-headers.seg": error closing storage: FlushFileBuffers: The handle is invalid.
WARN[08-17|08:26:07.226] "v1-008000-008500-headers.seg": error closing storage: FlushFileBuffers: The handle is invalid.

and Erigon seem not close cleanly.

BUT restarting immedately after it does not produce any error and the download resumes correctly.

I'd really appreciate if you could answer some questions:

Your guest OS (Windows 2012 R2) is quite old and is not supported by Microsoft anymore. Any chance you can test on a newer OS ? Or better ...
Can you test on a physical machine (with up-to-date OS) so we can remove one possible point of failure (virtualization) ?
Are you running Erigon from a command prompt or a powershell prompt ? If the latter which is your PowerShell version ? (you can check by $PSVersionTable)
Your virtualized environment relies on directly attached storage or you're using some NAS/SAN ?
Can you fetch the latest devel branch and rebuild erigon (your actual build is a few commits behind)

Thank you

Aug 17 '22 07:08 AndreaLanfranchi

thank you. "error closing storage: FlushFileBuffers: The handle is invalid." is much more useful than "handle is invalid".

Aug 17 '22 09:08 AskAlexSharov

@AndreaLanfranchi "v1-006000-006500-bodies.seg": error closing storage: FlushFileBuffers: The handle is invalid. - does this file exists?

Aug 17 '22 09:08 AskAlexSharov

looks similar to https://github.com/edsrzf/mmap-go/issues/22 but we using [email protected] where this problem solved...

Aug 17 '22 09:08 AskAlexSharov

@AndreaLanfranchi "v1-006000-006500-bodies.seg": error closing storage: FlushFileBuffers: The handle is invalid. - does this file exists?

@AskAlexSharov yes it does.

Please also note that after CTRL+C erigon does not seem to close cleanly. I generally used to see log messages about "Closing database ..." but now after the errors we get immediately to the prompt

Aug 17 '22 10:08 AndreaLanfranchi

Getting the same behavior on a resumed initial snapshot mainnet sync on 2022.08.02.

Aug 18 '22 04:08 3nprob

~Seems like erigon is leaving around corrupted files it can't handle.~

~Did not dig deeper but perhaps this helps in debugging and unblocking: Running the process with strace erigon ... | grep -E 'snapshots/' will output which files are being read by the process. Then, if the last line before the process dies is:~

openat(AT_FDCWD, "/mydir/snapshots/v1-007000-007500-transactions.seg.torrent")

~, I would delete /mydir/snapshots/v1-007000-007500-transactions.* and try again.~

~It does also try to open the corresponding .idx files, none of which exist, but I guess that could be expected behavior?~

~After a couple of times (I ended up removing ~10% of the 200GB synced so far, with a mix of headers/bodies/transactions), the process continues startup and snapshot sync pick up again, reporting 95% progress (which seems sus considering that more than that was removed, will see what happens when it completes).~

Aug 18 '22 05:08 3nprob

I solved it but redownloading and installing. The problem is that --prune htrc (I don't recall the ultimate pruning version) stopped working and it worked when I replaced with the full text version

Aug 18 '22 05:08 louisguthmannStarkWare

I solved it but redownloading and installing. The problem is that --prune htrc (I don't recall the ultimate pruning version) stopped working and it worked when I replaced with the full text version

@enriavil1 hi, please help to check if comment above is related to https://github.com/ledgerwatch/erigon/pull/5065 thank you.

Aug 18 '22 05:08 AskAlexSharov

I encountered the same issue with v2022.08.02, rolling back to v2022.07.04 resolved it for me. Let me know if there's any other info I can provide that would be helpful

Aug 18 '22 06:08 alavertu

~Update: I just exited the ongoing resumed sync (^C in the tty, so SIGINT) which yielded the same error (identical stack trace to here and #5096) when starting up again. Downgrading to 202208.01 instead and trying again on the same datadir, it seems to proceed like normal, without any need to delete files like I needed to get 202208.02 to recover.~

~So def looks like a regression in 202208.02.~

~So far we only saw this on mainnet, not on goerli.~

Aug 18 '22 06:08 3nprob

Fixed "nil pointer" problem by: fix nil pointer at startup when not all indices available yet #5070

Apologies for the noise, looks like this is fixed in develop already and should be included in next release.

When switching to current develop, erigon starts up and proceeds fetching the missing ~20GB so looks all good in the hood.

Aug 18 '22 07:08 3nprob

@AskAlexSharov PR #5075 should have solved the pruning problem

Aug 18 '22 09:08 enriavil1

PS if you someone started syncing before my PR and stop erigon and try syncing again then it will show an error saying the prune flag has been changed.

Aug 18 '22 14:08 enriavil1

This issue is stale because it has been open for 40 days with no activity. Remove stale label or comment, or this will be closed in 7 days.

Sep 29 '22 04:09 github-actions[bot]

This issue was closed because it has been stalled for 7 days with no activity.

Oct 08 '22 03:10 github-actions[bot]

erigon erigon copied to clipboard

Bug on attempting to restart blockchain sync

System information

Expected behaviour

Actual behaviour

Steps to reproduce the behaviour

Backtrace

erigon
erigon copied to clipboard