Intent to Implement: Recovery of Missing Logs
Rationale
Why should this feature exist?
Minor database corruption should be automatically recoverable without resyncing.
What are the use-cases?
There is a 831-block gap between the receipts in my ancients and the receipts in leveldb. This prevents freezing.
Sep 01 17:02:00 pine geth[3250214]: ERROR[09-01|17:02:00.292] Error in block freeze operation err="block receipts missing, can't freeze block 14191561"
But the gap is from so long ago that I don't want to resync if I don't have to. To my knowledge, geth removedb is the official recommended fix for this problem.
Implementation
Do you have ideas regarding the implementation of this feature?
Yes. I plan to detect the missing logs in ReadReceiptsRLP. When missing logs are detected, a goroutine will be started to initiate recovery. The boundaries of the missing logs will be found with a binary search and then inserted with InsertReceiptChain. I believe I will have to modify Downloader to have an end block or make a separate downloader for this.
Are you willing to implement this feature?
Yes, and I have started an implementation. I am looking for early feedback on the Downloader approach, so I don't accidentally break my reproducing environment.
Besides ancients being kept behind, the log gap also causes a problem when syncing deposits from Eth1 for Eth2.
The trie healer in snap sync doesn't heal receipts. Yet another approach to solving the issue might augment the healer such that it can receive requests to heal receipts. I am not currently pursuing this approach because it seems that the healer is fairly specialized at healing the trie.
To work around the import dependency graph, the precise error I am detecting in core/rawdb will likely need to be propagated upward to its consumers, some of whom will report it to the recovery system. So error will need to be added to the return of ReadReceiptsRLP and its consumers. This will slightly increase the size of my changeset, mostly in the number of files touched.
I'm unsure this is a good idea. If the database is corrupted, something went horribly wrong. It should never get corrupted.
Adding mechanisms to detect and correctly fix faults will be a huge effort to create and maintain and they will probably never completely cover all the corner cases. In these cases we really should be spending time to figure out what happened and why the gap exists and fix the underlying issue, not the symptom.
E.g. ReadReceiptsRLP can also be useed AFAI by the RPC to request stuff, where you can actually request non existing stuff. Or perhaps already pruned sidechains, or perhaps in the future pruned legacy chain segments. In all these cases, you don't want to trigger a fixup. This is just an example that correctly detecting that data is missing and should not be is already an interesting challenge.
where you can actually request non existing stuff. Or perhaps already pruned sidechains, or perhaps in the future pruned legacy chain segments. In all these cases, you don't want to trigger a fixup
I think I am handling those cases by only triggering the fixup when the headers in the range are both known and canonical.
But you've steered me in another direction: perhaps the repair should be manual, triggered by a debug_ command. I can still detect and report receipt gaps to the log, but the actual repair would be manual.
I think I know how the gap can happen; the receipts are downloaded after the headers in SnapSync, and while it may be retried on failure, shutdown can happen before then, and if the header chain is complete after snap sync starts again, snap sync won't go looking for missing receipts. Automated recovery would fix the issue, but I am still pursuing the manual approach, to minimize the risk of such automation further corrupting the database.
I don't believe I should use the receiptTaskPool of downloader.Downloader because the missing receipts are in the stale range and I don't think debug_resyncReceiptRange method should change queue.resultOffset since that has its own lifecycle. Safely reducing the resultOffset within a normal managed lifecycle would not be good either, since the downloader would sync from this point. So I am favoring a separate downloader (tentatively named Scavenger) specialized for such one-off requests.
A similar issue I am seeing on another node:
Sep 06 15:09:41 adm geth[1213331]: INFO [09-06|15:09:41.110] Loaded most recent local header number=15,437,129 hash=6f2521..0d7e54 td=57,516,172,497,790,252,883,625 age=1w22h15m
Sep 06 15:09:41 adm geth[1213331]: INFO [09-06|15:09:41.111] Loaded most recent local full block number=15,437,129 hash=6f2521..0d7e54 td=57,516,172,497,790,252,883,625 age=1w22h15m
Sep 06 15:09:41 adm geth[1213331]: INFO [09-06|15:09:41.111] Loaded most recent local fast block number=15,437,129 hash=6f2521..0d7e54 td=57,516,172,497,790,252,883,625 age=1w22h15m
Sep 06 15:09:41 adm geth[1213331]: INFO [09-06|15:09:41.111] Loaded last fast-sync pivot marker number=15,356,506
Sep 06 15:09:41 adm geth[1213331]: WARN [09-06|15:09:41.116] Head state missing, repairing number=15,437,129 hash=6f2521..0d7e54 snaproot=fdb3ec..80f225
Sep 06 15:10:13 adm geth[1213331]: ERROR[09-06|15:10:13.190] Error in block freeze operation err="canonical hash missing, can't freeze block 14793693"
In this case the canonical hash is missing, which is considerably more severe. I don't plan to fix that in my change.
I have a WIP that seemed to work, but later the freezer table ran into another issue where it was missing the canonical hash for some blocks (not the ones previously missing receipts). So it seems the corruption was broader than previously understood.
In these cases we really should be spending time to figure out what happened and why the gap exists and fix the underlying issue, not the symptom.
When I was going through the p2p code the mystery widened because it seems that for most (maybe all?) of my peers the receipt bandwidth is 1 per request, so the gap of 831 was less-likely to be just one chunk that was missed or not retried. An entire segment likely wasn't queued for receipt fetching in the first place. I still suspect this is from bad shutdown timing during initial sync.
I think I know how the gap can happen; the receipts are downloaded after the headers in SnapSync, and while it may be retried on failure, shutdown can happen before then, and if the header chain is complete after snap sync starts again, snap sync won't go looking for missing receipts
Missing receipts (block bodies) most definitely should be downloaded if they weren't completely finished before shutdown happened. So snap sync should go looking for missing recieipts (and block bodies)
But apparently there is something wrong (since we sometimes miss receipts), so if you have a theory of what happens I'm all ears.
I ended up nuking my reproducing datadir and resyncing but I still have the manual recovery function implemented on my branch. I can cut a PR for it if yall would want it. But I previously got the impression that you would not.
I think we're not that interested in the manual recovery feature, but if you have any idea about the root cause, that would be interesting.
The reason you stated above sounds, to me, not correct. But maybe I am wrong and we have some error there somewhere. So yeah, fresh ideas are appreciated in solving the root issue.