erigon git: sync rewards calculated incorrectly because of totalActiveBalance being off-by-one in EpochData storage

System information

OS & Version: Linux

Commit hash: b66cc690b8002cf54f7dea09dd2b5ce0a232ea63

Erigon Command (with flags/config): full caplin sync

The problem

Sync rewards are calculated slightly incorrectly:

$ syncwrong.sh 
Erigon calculated reward:
722976
From blockcha.in:
                      'sync_committee_reward': 723008},

The calculated reward is off by 32, because every slot is off by 1, and we have 32 slots in the epoch.

Here is the script called syncwrong.sh if anyone wants to reproduce the issue:

#!/bin/bash

epoch=317934
slot=$((epoch*32))
validator=5380

echo Erigon calculated reward:
curl -s -d '["'$validator'"]' http://localhost:3500/eth/v1/beacon/rewards/sync_committee/$slot | jq '.data[0].reward | tonumber * 32'
echo From blockcha.in:
tests/caplinrpc/beaconcha.in-query.py $validator $epoch | grep sync_committee_reward

Workaround

I investigated for a while, and first of all, I just wanted to say that if anybody wants a quick fix, there is this patch (that I use now):

modified   cl/beacon/handler/rewards.go
@@ -169,7 +169,7 @@ func (a *ApiHandler) PostEthV1BeaconRewardsSyncCommittees(w http.ResponseWriter,
 		if !isCanonical {
 			return nil, beaconhttp.NewEndpointError(http.StatusNotFound, errors.New("non-canonical finalized block not found"))
 		}
-		epochData, err := state_accessors.ReadEpochData(tx, a.beaconChainCfg.RoundSlotToEpoch(blk.Block.Slot))
+		epochData, err := state_accessors.ReadEpochData(tx, a.beaconChainCfg.RoundSlotToEpoch(blk.Block.Slot+32))
 		if err != nil {
 			return nil, err
 		}

Having this patch in, I was able to confirm that the calculation is fixed. Now, the patch itself doesn't make sense, why would we get the epochData from the next epoch to do a calculation this epoch?

Investigation

I added some debug print statements, and my conclusion right now, is that unfortunately the epochData.TotalActiveBalance field is off-by-one epoch in the database. This happens, because cl/antiquary/beacon_states_collector.go:storeEpochData writes to the antique DB before cl/phase1/forkchoice/on_block.go:OnBlock actually deals with the epoch transition. I'm not 100% sure of this, but kind of confident.

What is sure, that the epoch data is screwed in the DB already with current 3.0.0-alpha caplin:

$ curl http://localhost:3500/lighthouse/validator_inclusion/317928/125
{"data":{"total_active_balance":34334260000000000,"total_active_balance_prev":34334516000000000}}
$ sleep 3600
$ curl http://localhost:3500/lighthouse/validator_inclusion/317928/125
{"data":{"total_active_balance":34334516000000000,"total_active_balance_prev":34334260000000000}}

I modified the validator_inclusion endpoint locally to only show total_active_balance instead of the original lighthouse specific debug values. And then I waited for 317928 to be just finished, and ran the above commands.

Reference data: https://beaconcha.in/epoch/317928 (correct number is 34,334,260 ETH).

As one can see, the first answer from caplin is correct (from forkchoiceStore), but the second one is incorrect (from antique db store), and also it matches my theory that it's off by one: https://beaconcha.in/epoch/317927

I'm happy to work on this and propose pull requests, but I need quite a lot of help, because I have a lot of questions:

how to make sure that we only trigger creation of the DB entry a bit later, when OnBlock already updated the data?
can we somehow conclude if other fields in EpochData are also off-by-one or not?
how do we handle the already existing DB of the existing erigon instances?
should we just add the +32 hack with a big wall of comment and accept the uglyness???

Can someone from the erigon team kindly help out here? @Giulio2002 can you please take this on or route me in the team, if you are not the appropriate person? Thanks!

Oct 14 '24 16:10 errge

mmmh - Ok - I think I can guess what the issue is... Btw, thanks for noticing... I am about to remake the archivial node of caplin so that It can be synced only in a few hours so this helps.

Oct 14 '24 17:10 Giulio2002

$ curl http://localhost:3500/lighthouse/validator_inclusion/317928/125
{"data":{"total_active_balance":34334260000000000,"total_active_balance_prev":34334516000000000}}
$ sleep 3600
$ curl http://localhost:3500/lighthouse/validator_inclusion/317928/125
{"data":{"total_active_balance":34334516000000000,"total_active_balance_prev":34334260000000000}}

Can I have some context in what this is? an ongoing epoch that then finished and go processed in the database? Anyway - I think that it is unlikely that ReadEpochData to be messed up and the reason is that, if that were the case the replay of historical states would be significantly broken as it relies on that value being correct...

Oct 14 '24 18:10 Giulio2002

Hey could you try branch fix-total-active-arch? it will not fix historical active balances but it will start generating correct ones onwards

Oct 14 '24 19:10 Giulio2002

$ curl http://localhost:3500/lighthouse/validator_inclusion/317928/125
{"data":{"total_active_balance":34334260000000000,"total_active_balance_prev":34334516000000000}}
$ sleep 3600
$ curl http://localhost:3500/lighthouse/validator_inclusion/317928/125
{"data":{"total_active_balance":34334516000000000,"total_active_balance_prev":34334260000000000}}
Can I have some context in what this is? an ongoing epoch that then finished and go processed in the database? Anyway - I think that it is unlikely that ReadEpochData to be messed up and the reason is that, if that were the case the replay of historical states would be significantly broken as it relies on that value being correct...

The context for this is this:

first curl was executed when the epoch was already finalized, but still reachable from forkchoicestore (epoch > forkchoicestore.lowestavailable)
second curl was executed when the epoch was already so old, that it was only available from the db (epoch << forkchoicestore.lowestavailable)

Finalization was true in both cases, but in the first case it was finalized just 1-2 minutes ago, the second case it was finalized (and antiquated) already more than an hour ago.

Oct 14 '24 20:10 errge

Hey could you try branch fix-total-active-arch? it will not fix historical active balances but it will start generating correct ones onwards

I currently don't have a btrfs setup, so I can't easily make a copy-on-write backup of my DB, and therefore I can't really try changes that result in a DB that is half-half.

My worry is the following: once I apply this change and start running with it, I can't even write a workaround, because I have to somehow know at query time if I'm querying the "old and broken" part or the "new and good" part.

Alternatively, is your change fully isolated to <datadir>/caplin? So if I back that up, can I try out, report back and restore? Sorry, I'm not familiar enough with the data layout to risk this without asking first.

Oct 14 '24 20:10 errge

Also, now that we are talking about this code, I have a followup question: what happens if the epoch (or after the fix epoch+1) slot is missed on the beacon chain? Does this code still run? Because it seems important to do this bookkeeping at the right moment and I was just wondering about the different error cases.

Oct 14 '24 20:10 errge

Also, now that we are talking about this code, I have a followup question: what happens if the epoch (or after the fix epoch+1) slot is missed on the beacon chain? Does this code still run? Because it seems important to do this bookkeeping at the right moment and I was just wondering about the different error cases.

yes - all good... the slot has to be processed even if empty.

Oct 14 '24 20:10 Giulio2002

Hmm, is there a command to delete caplin processed states from the DB by slot/epoch when erigon is stopped? Then I would be able to test your code, by remembering where caplin was before I tried and just deleting later the epochs/slots that was created by the patched code... d just deleting later the epochs/slots that was created by the patched code... Also, I will make my transition to btrfs with the next resync. I just really didn't think that I will get into erigon dev this much, when I first installed it I just wanted a full history ethereum+beacon node and was thinking ext4 is faster, didn't know I will become involved a bit.

Oct 14 '24 21:10 errge

Mh - I can be the one testing it... no worries

Oct 14 '24 21:10 Giulio2002

for now - use your patch... I have a synced node and will do testing shortly

Oct 14 '24 21:10 Giulio2002

I fixed it in https://github.com/erigontech/erigon/pull/12314 . I will not include this in the next alpha tho. I recommend you to use your patch for the time being and I will leave this ticket open accordingly

Oct 14 '24 23:10 Giulio2002

Sounds like a plan, quick idea: why don't we include my patch until then in git? The hack is fully local to an RPC code, and easy to remove it at the same time when you include #12314 later. We could even already include the removal in #12314 so you don't forget.

Should I prepare a PR with appropriate comment saying it's a temporary hack?

Oct 14 '24 23:10 errge

Ohh, and don't forget that I haven't investigated that maybe the other fields of the EpochData are off-by-one epoch too, which can cause trouble in the future.

Oct 14 '24 23:10 errge

no - it is just that we are not resetting a cache. all other fields are fine

Oct 15 '24 21:10 Giulio2002