go-ethereum icon indicating copy to clipboard operation
go-ethereum copied to clipboard

Intent to Implement: Recovery of Missing Logs

Open wjmelements opened this issue 3 years ago • 9 comments

Rationale

Why should this feature exist?

Minor database corruption should be automatically recoverable without resyncing.

What are the use-cases?

There is a 831-block gap between the receipts in my ancients and the receipts in leveldb. This prevents freezing.

Sep 01 17:02:00 pine geth[3250214]: ERROR[09-01|17:02:00.292] Error in block freeze operation          err="block receipts missing, can't freeze block 14191561"

But the gap is from so long ago that I don't want to resync if I don't have to. To my knowledge, geth removedb is the official recommended fix for this problem.

Implementation

Do you have ideas regarding the implementation of this feature?

Yes. I plan to detect the missing logs in ReadReceiptsRLP. When missing logs are detected, a goroutine will be started to initiate recovery. The boundaries of the missing logs will be found with a binary search and then inserted with InsertReceiptChain. I believe I will have to modify Downloader to have an end block or make a separate downloader for this.

Are you willing to implement this feature?

Yes, and I have started an implementation. I am looking for early feedback on the Downloader approach, so I don't accidentally break my reproducing environment.

wjmelements avatar Aug 26 '22 18:08 wjmelements

Besides ancients being kept behind, the log gap also causes a problem when syncing deposits from Eth1 for Eth2.

wjmelements avatar Aug 26 '22 21:08 wjmelements

The trie healer in snap sync doesn't heal receipts. Yet another approach to solving the issue might augment the healer such that it can receive requests to heal receipts. I am not currently pursuing this approach because it seems that the healer is fairly specialized at healing the trie.

wjmelements avatar Aug 29 '22 18:08 wjmelements

To work around the import dependency graph, the precise error I am detecting in core/rawdb will likely need to be propagated upward to its consumers, some of whom will report it to the recovery system. So error will need to be added to the return of ReadReceiptsRLP and its consumers. This will slightly increase the size of my changeset, mostly in the number of files touched.

wjmelements avatar Aug 29 '22 22:08 wjmelements

I'm unsure this is a good idea. If the database is corrupted, something went horribly wrong. It should never get corrupted.

Adding mechanisms to detect and correctly fix faults will be a huge effort to create and maintain and they will probably never completely cover all the corner cases. In these cases we really should be spending time to figure out what happened and why the gap exists and fix the underlying issue, not the symptom.

E.g. ReadReceiptsRLP can also be useed AFAI by the RPC to request stuff, where you can actually request non existing stuff. Or perhaps already pruned sidechains, or perhaps in the future pruned legacy chain segments. In all these cases, you don't want to trigger a fixup. This is just an example that correctly detecting that data is missing and should not be is already an interesting challenge.

karalabe avatar Aug 30 '22 04:08 karalabe

where you can actually request non existing stuff. Or perhaps already pruned sidechains, or perhaps in the future pruned legacy chain segments. In all these cases, you don't want to trigger a fixup

I think I am handling those cases by only triggering the fixup when the headers in the range are both known and canonical.

But you've steered me in another direction: perhaps the repair should be manual, triggered by a debug_ command. I can still detect and report receipt gaps to the log, but the actual repair would be manual.

wjmelements avatar Aug 30 '22 16:08 wjmelements

I think I know how the gap can happen; the receipts are downloaded after the headers in SnapSync, and while it may be retried on failure, shutdown can happen before then, and if the header chain is complete after snap sync starts again, snap sync won't go looking for missing receipts. Automated recovery would fix the issue, but I am still pursuing the manual approach, to minimize the risk of such automation further corrupting the database.

wjmelements avatar Aug 30 '22 19:08 wjmelements

I don't believe I should use the receiptTaskPool of downloader.Downloader because the missing receipts are in the stale range and I don't think debug_resyncReceiptRange method should change queue.resultOffset since that has its own lifecycle. Safely reducing the resultOffset within a normal managed lifecycle would not be good either, since the downloader would sync from this point. So I am favoring a separate downloader (tentatively named Scavenger) specialized for such one-off requests.

wjmelements avatar Aug 30 '22 22:08 wjmelements

A similar issue I am seeing on another node:

Sep 06 15:09:41 adm geth[1213331]: INFO [09-06|15:09:41.110] Loaded most recent local header          number=15,437,129 hash=6f2521..0d7e54 td=57,516,172,497,790,252,883,625 age=1w22h15m
Sep 06 15:09:41 adm geth[1213331]: INFO [09-06|15:09:41.111] Loaded most recent local full block      number=15,437,129 hash=6f2521..0d7e54 td=57,516,172,497,790,252,883,625 age=1w22h15m
Sep 06 15:09:41 adm geth[1213331]: INFO [09-06|15:09:41.111] Loaded most recent local fast block      number=15,437,129 hash=6f2521..0d7e54 td=57,516,172,497,790,252,883,625 age=1w22h15m
Sep 06 15:09:41 adm geth[1213331]: INFO [09-06|15:09:41.111] Loaded last fast-sync pivot marker       number=15,356,506
Sep 06 15:09:41 adm geth[1213331]: WARN [09-06|15:09:41.116] Head state missing, repairing            number=15,437,129 hash=6f2521..0d7e54 snaproot=fdb3ec..80f225
Sep 06 15:10:13 adm geth[1213331]: ERROR[09-06|15:10:13.190] Error in block freeze operation          err="canonical hash missing, can't freeze block 14793693"

In this case the canonical hash is missing, which is considerably more severe. I don't plan to fix that in my change.

wjmelements avatar Sep 06 '22 22:09 wjmelements

I have a WIP that seemed to work, but later the freezer table ran into another issue where it was missing the canonical hash for some blocks (not the ones previously missing receipts). So it seems the corruption was broader than previously understood.

In these cases we really should be spending time to figure out what happened and why the gap exists and fix the underlying issue, not the symptom.

When I was going through the p2p code the mystery widened because it seems that for most (maybe all?) of my peers the receipt bandwidth is 1 per request, so the gap of 831 was less-likely to be just one chunk that was missed or not retried. An entire segment likely wasn't queued for receipt fetching in the first place. I still suspect this is from bad shutdown timing during initial sync.

wjmelements avatar Sep 19 '22 17:09 wjmelements

I think I know how the gap can happen; the receipts are downloaded after the headers in SnapSync, and while it may be retried on failure, shutdown can happen before then, and if the header chain is complete after snap sync starts again, snap sync won't go looking for missing receipts

Missing receipts (block bodies) most definitely should be downloaded if they weren't completely finished before shutdown happened. So snap sync should go looking for missing recieipts (and block bodies)

But apparently there is something wrong (since we sometimes miss receipts), so if you have a theory of what happens I'm all ears.

holiman avatar Apr 26 '23 16:04 holiman

I ended up nuking my reproducing datadir and resyncing but I still have the manual recovery function implemented on my branch. I can cut a PR for it if yall would want it. But I previously got the impression that you would not.

wjmelements avatar Apr 26 '23 20:04 wjmelements

I think we're not that interested in the manual recovery feature, but if you have any idea about the root cause, that would be interesting.

The reason you stated above sounds, to me, not correct. But maybe I am wrong and we have some error there somewhere. So yeah, fresh ideas are appreciated in solving the root issue.

holiman avatar Apr 26 '23 20:04 holiman