prysm icon indicating copy to clipboard operation
prysm copied to clipboard

prysm 3.0.0 beacon data gets corrupted

Open ybstaked opened this issue 2 years ago • 15 comments

🐞 Bug Report

Description

prysm 3.0.0 beacon data gets corrupted while running

🔬 Minimal Reproduction

running prysm beacon 3.0.0

beacon-chain --accept-terms-of-use --datadir /home/prysm/data --rpc-host 0.0.0.0 --rpc-port 4000 --grpc-gateway-host 0.0.0.0 --grpc-gateway-port 4001 --log-file /tmp/ethereum2.log --block-batch-limit=512 --http-web3provider=http://el-geth-mainnet-4-node-svc.eth.svc.cluster.local:8551/ --jwt-secret=/home/prysm/data/jwt/jwt-secret --checkpoint-sync-url=XXX --genesis-beacon-api-url=XXX --mainnet --suggested-fee-recipient=XXX

🔥 Error

Before node is broken

time="2022-08-26 13:22:59" level=info msg="Peer summary" activePeers=45 inbound=0 outbound=45 prefix=p2p                                                                                                                                   │
│ time="2022-08-26 13:23:00" level=warning msg="Recovering state summary for finalized root: 0x6a097ddb6aa3" prefix=db                                                                                                                       │
│ time="2022-08-26 13:23:00" level=warning msg="Batch is not processed" error="could not process block in batch: could not save finalized checkpoint, finalized root: 0x6a097ddb6aa3: missing state summary for checkpoint root" prefix=init │
│ time="2022-08-26 13:23:00" level=info msg="Processing block batch of size 495 starting from  0x23e46bac... 4554913/4558013 - estimated time remaining 2m5s" blocksPerSecond=24.8 peers=45 prefix=initial-sync

Upon a restart after the previous state

time="2022-08-26 14:11:37" level=info msg="Peer summary" activePeers=43 inbound=0 outbound=43 prefix=p2p
time="2022-08-26 14:11:38" level=info msg="Processing block batch of size 498 starting from  0x4bdc9494... 4556449/4558256 - estimated time remaining 13s" blocksPerSecond=138.8 peers=43 prefix=initial-sync
time="2022-08-26 14:11:38" level=warning msg="Batch is not processed" error="beacon node doesn't have a parent in db with root: 0xf80913bd90d52370de2df89c98cdf9bc4d40ab07365d29ff2c8ce99e45a6b215 (in processBatchedBlocks, slot=4556449)" prefix=initial-sync
time="2022-08-26 14:11:47" level=info msg="Processing block batch of size 495 starting from  0x23e46bac... 4554913/4558257 - estimated time remaining 20s" blocksPerSecond=163.6 peers=46 prefix=initial-sync

Once node is broken

time="2022-08-26 03:21:45" level=error msg="could not get ancestor state: failed to unmarshal encoding: incorrect size" prefix=main

🌍 Your Environment

Operating System:

  
Kubernetes
  

What version of Prysm are you running? (Which release)

  
v3.0.0
  

Anything else relevant (validator index / public key)?

Caused missed attestations for:

  • https://beaconcha.in/validator/50226#attestations
  • https://beaconcha.in/validator/50230#attestations
  • https://beaconcha.in/validator/112312#attestations
  • https://beaconcha.in/validator/128525#attestations
  • https://beaconcha.in/validator/128526#attestations
  • https://beaconcha.in/validator/128527#attestations

ybstaked avatar Aug 26 '22 16:08 ybstaked

Confirming I am having similar issues using Image: gcr.io/prysmaticlabs/prysm/beacon-chain:v3.0.0:

 time="2022-08-29 15:31:16" level=warning msg="Running on Ethereum Mainnet" prefix=flags
│ time="2022-08-29 15:31:16" level=info msg="Checking DB" database-path="/var/lib/prysm/beacon-node/beaconchaindata" prefix=node
│ time="2022-08-29 15:31:16" level=info msg="Opening Bolt DB at /var/lib/prysm/beacon-node/beaconchaindata/beaconchain.db" prefix=db
│ time="2022-08-29 15:31:16" level=info msg="Deposit contract: 0x00000000219ab540356cbb839cbe05303d7705fa" prefix=node
│ time="2022-08-29 15:31:16" level=error msg="could not get ancestor state: failed to unmarshal encoding: incorrect size" prefix=main
│ Stream closed EOF for int/erigon-2 (prysm-beacon)

I run multiple erigon nodes, and this only happens on one of the nodes (each node has k8s Pod with a erigon container and prysm-beacon container).

I was able to workaround this problem by scaling erigon replicas down below the bad state prysm-beacon, deleting the PersistentVolumeClaim, PersistentVolume, and the Disk in GCP, then scaling erigon replicas back up.

beejiujitsu avatar Aug 29 '22 15:08 beejiujitsu

Thanks for the report. I'll admit, this error is a bit unhelpful as there are three places that unmarshalling errors could have returned. I have added some additional context in PR #11342.

We also run in kubernetes with go-ethereum and have never experienced this issue. Would you be willing to share a corrupted database privately?

prestonvanloon avatar Aug 29 '22 15:08 prestonvanloon

Unfortunately, no, we had to cleanup the pvcs and restart from checkpoint so we didn't continue to miss attestations and lose rewards.

ybstaked avatar Aug 29 '22 16:08 ybstaked

Unfortunately, no, we had to cleanup the pvcs and restart from checkpoint so we didn't continue to miss attestations and lose rewards.

OK, no worries. #11342 should help in v3.0.1. We will keep monitoring and investigation open until we can pin point the issue.

If anyone stumbles upon this issue and hasn't deleted their beaconchain.db, please save a copy and reach out to me [email protected] or discord. It would greatly help to see what bytes are failing to unmarshal.

prestonvanloon avatar Aug 29 '22 17:08 prestonvanloon

See another set of errors in another node here:

time="2022-08-30 15:09:28" level=info msg="Processing block batch of size 199 starting from  0x8042c314... 4586753/4587345 - estimated time remaining 59s" blocksPerSecond=9.9 peers=46 prefix=initial-sync
time="2022-08-30 15:09:29" level=info msg="Peer summary" activePeers=46 inbound=0 outbound=46 prefix=p2p
time="2022-08-30 15:09:37" level=error msg="Unable to process past deposit contract logs, perhaps your execution client is not fully synced" error="no contract code at given address" prefix=powchain
time="2022-08-30 15:10:07" level=warning msg="Recovering state summary for finalized root: 0xf4af42324d36" prefix=db
time="2022-08-30 15:10:07" level=warning msg="Batch is not processed" error="could not process block in batch: could not save finalized checkpoint, finalized root: 0xf4af42324d36: missing state summary for checkpoint root" prefix=initial-sync
time="2022-08-30 15:10:07" level=info msg="Processing block batch of size 199 starting from  0x8042c314... 4586753/4587348 - estimated time remaining 59s" blocksPerSecond=9.9 peers=45 prefix=initial-sync
time="2022-08-30 15:10:29" level=info msg="Peer summary" activePeers=46 inbound=0 outbound=46 prefix=p2p
time="2022-08-30 15:10:47" level=warning msg="Recovering state summary for finalized root: 0xf4af42324d36" prefix=db
time="2022-08-30 15:10:47" level=warning msg="Batch is not processed" error="could not process block in batch: could not save finalized checkpoint, finalized root: 0xf4af42324d36: missing state summary for checkpoint root" prefix=initial-sync
time="2022-08-30 15:10:47" level=info msg="Processing block batch of size 507 starting from  0x8042c314... 4586753/4587352 - estimated time remaining 23s" blocksPerSecond=25.4 peers=46 prefix=initial-sync
time="2022-08-30 15:11:29" level=info msg="Peer summary" activePeers=45 inbound=0 outbound=45 prefix=p2p
time="2022-08-30 15:11:54" level=error msg="Unable to process past deposit contract logs, perhaps your execution client is not fully synced" error="no contract code at given address" prefix=powchain
time="2022-08-30 15:12:27" level=warning msg="Recovering state summary for finalized root: 0xf4af42324d36" prefix=db
time="2022-08-30 15:12:27" level=warning msg="Batch is not processed" error="could not process block in batch: could not save finalized checkpoint, finalized root: 0xf4af42324d36: missing state summary for checkpoint root" prefix=initial-sync
time="2022-08-30 15:12:27" level=info msg="Processing block batch of size 507 starting from  0x8042c314... 4586753/4587360 - estimated time remaining 23s" blocksPerSecond=25.4 peers=44 prefix=initial-sync
time="2022-08-30 15:12:29" level=info msg="Peer summary" activePeers=46 inbound=0 outbound=44 prefix=p2p
time="2022-08-30 15:13:29" level=info msg="Peer summary" activePeers=44 inbound=0 outbound=44 prefix=p2p

Don't have direct access to this mainnet beaconchain.db, so I cannot send it to you.

ybstaked avatar Aug 30 '22 15:08 ybstaked

Maybe this is an issue when restoring from checkpoint sync?

ybstaked avatar Aug 30 '22 15:08 ybstaked

@ybstaked Does this error occur right away when starting a node with checkpoint sync?

time="2022-08-30 15:10:47" level=warning msg="Batch is not processed" error="could not process block in batch: could not save finalized checkpoint, finalized root: 0xf4af42324d36: missing state summary for checkpoint root" prefix=initial-sync

Edit: Could you share what flags you are using for this node and the node serving the checkpoint?

prestonvanloon avatar Aug 30 '22 15:08 prestonvanloon

No, this error does not occur right away when starting a node with checkpoint sync. I shared our node's flags before, but here it is again. We are using infura currently as the checkpoint node.

beacon-chain --accept-terms-of-use --datadir /home/prysm/data --rpc-host 0.0.0.0 --rpc-port 4000 --grpc-gateway-host 0.0.0.0 --grpc-gateway-port 4001 --log-file /tmp/ethereum2.log --block-batch-limit=512 --http-web3provider=http://el-geth-mainnet-4-node-svc.eth.svc.cluster.local:8551/ --jwt-secret=/home/prysm/data/jwt/jwt-secret [email protected] [email protected] --mainnet --suggested-fee-recipient=XXX

ybstaked avatar Aug 30 '22 15:08 ybstaked

Thanks for that context. We are still investigating both of these issues. The first one may be related to an improper shutdown of the client. Kubernetes podSpec default value for terminationGracePeriodSeconds is 30. This may be too short for Prysm to shutdown properly. I would recommend increasing it to 120 and ensuring that you are using a solid state disk for faster writes.

prestonvanloon avatar Aug 30 '22 15:08 prestonvanloon

Ok noted, but in this case, the first error I sent you here was before any restarts or shutdowns: https://github.com/prysmaticlabs/prysm/issues/11327#issuecomment-1231810731

ybstaked avatar Aug 30 '22 15:08 ybstaked

@prestonvanloon I emailed you the beaconchain.db specifically from: https://github.com/prysmaticlabs/prysm/issues/11327#issuecomment-1231810731

ybstaked avatar Aug 30 '22 16:08 ybstaked

Thanks @ybstaked. I was able to reproduce this issue. Still investigating...

prestonvanloon avatar Aug 30 '22 18:08 prestonvanloon

I believe this issue is solved:

time="2022-08-30 15:10:47" level=warning msg="Recovering state summary for finalized root: 0xf4af42324d36" prefix=db

But this other one I do not really know how it happens. Would it be possible to check with the current develop images?

time="2022-08-26 03:21:45" level=error msg="could not get ancestor state: failed to unmarshal encoding: incorrect size" prefix=main

potuz avatar Aug 31 '22 13:08 potuz

thanks @potuz, for this error, @prestonvanloon mentioned that it may be related to an improper shutdown, i.e.: https://github.com/prysmaticlabs/prysm/issues/11327#issuecomment-1231843716 and also said:

"Later this week, I'll try some experimentation with SIGKILL to see if I can replicate the state you were in. We may not be able to prevent improper shutdowns, but maybe we can repair damaged databases automatically."

We will on our end increase the terminationGracePeriodSeconds, but can this also be made better in the software?

time="2022-08-26 03:21:45" level=error msg="could not get ancestor state: failed to unmarshal encoding: incorrect size" prefix=main

ybstaked avatar Aug 31 '22 13:08 ybstaked

Btw I just checked and we use a terminationGracePeriodSeconds of 180 seconds (this is what was set when I posted that error; more than your recommendation of 120 seconds)

ybstaked avatar Sep 02 '22 12:09 ybstaked

hi @ybstaked wondering if you are still running into this issue post-merge

rauljordan avatar Jan 05 '23 00:01 rauljordan

@rauljordan This is still relevant, I am currently looking at it

nisdas avatar Jan 05 '23 10:01 nisdas

having the same error. Firstly Prysm synced nicely to the provided checkpoint and was following the chain head. Shut it down in order to set up a systems service, tried starting Prysm with the same command, and now getting ERROR main: could not get ancestor state: slot 5616191 not in db due to checkpoint sync: cannot retrieve data for slot

leohhhn avatar Jan 20 '23 15:01 leohhhn