teku icon indicating copy to clipboard operation
teku copied to clipboard

Cannot create attestation for future slot errors

Open ajsutton opened this issue 2 years ago • 20 comments

Description

Investigate why the beacon node time updates lag behind sometimes and cause attestation failures with "Cannot create attestation for future slot" error messages.

ajsutton avatar Aug 04 '22 01:08 ajsutton

I am getting about 5 of these every day now

W1337lt avatar Aug 16 '22 01:08 W1337lt

Worth noting here that the most common causes of this are 1) having a -Xmx set too low which causes Java to spend most of it's time running GC (-Xmx5g is generally recommended but feel free to ask for advice on discord for your specific setup), or 2) having contention on disk IO - particularly common with the merge coming up and a lot of people syncing an EL on the same machine.

So one part of reducing this is moving to async storage updates (#1934) which work is in progress on. There are likely other areas that could be optimised as well. Would be good to get multiple parallel investigations going into the different aspects of this.

ajsutton avatar Sep 12 '22 03:09 ajsutton

Also saw this post-merge. I'm running Besu, Teku BN and Teku Validator as separate processes. BN has -Xmx6g, no old-gen gcs or cpu spikes showing in prometheus monitoring for any of the processes and it's running on a NVMe drive so there shouldn't be IO issues (in theory). There's various discussion with logs and screenshots in Discord at around 15th Sept 09:55 UTC. Is there a way of gathering useful debug logs from leveldb2, if that is the likely contender for I/O spikes?

This appears in the leveldb LOG file just after the BN errors, don't know if it's useful or not.

2022/09/15-09:02:10.916789 140535372158720 Compacting 4@0 + 0@1 files

mread avatar Sep 15 '22 10:09 mread

This is happening to me constantly post-merge (but didn't happen at all pre). Happy to provide any logs necessary to help rectify!

EvilJordan avatar Sep 15 '22 15:09 EvilJordan

This is happening to me constantly post-merge (but didn't happen at all pre). Happy to provide any logs necessary to help rectify!

@EvilJordan A user reporting similar situation seemed to solve by deleting the db and starting from a checkpoint sync. Can you do that? let us know if you need instructions.

tbenr avatar Sep 15 '22 16:09 tbenr

Deleting Teku's db, right? Would love some expert-guidance so I don't screw it up!

EvilJordan avatar Sep 15 '22 16:09 EvilJordan

@EvilJordan https://discord.com/channels/697535391594446898/697539289042649190 reach me out on discord: tbenr#2593

tbenr avatar Sep 15 '22 16:09 tbenr

I'd be somewhat cautious about starting from a fresh checkpoint sync for a couple of reasons. It will need to download the historic blocks in the background which increases disk IO (and load on other peers in the network which can add up if a lot of nodes do this resync at once). It also means that initially you'll have a much smaller database but it will gradually grow to the same size again so it may be a temporary solution and then actually make things worse as the database gets close to full size but the historic blocks are still being downloaded.

ajsutton avatar Sep 15 '22 21:09 ajsutton

Running into the same problem, post merge. No Issues prior.

Running: ethereum/client-go:v1.10.23 consensys/teku:22.9.0

This is happening to me constantly post-merge (but didn't happen at all pre). Happy to provide any logs necessary to help rectify!

arajakul avatar Sep 15 '22 22:09 arajakul

So, post db-re-do and adding: -eth1-deposit-contract-max-request-size=1000 to the teku configs, this particular issue has gone away. Not sure if, like Adrian says, it's a temporary fix until the db grows, though.

I have a host of other issues, all stemming from Besu, however, that the team is aware of (timeouts and empty blocks).

EvilJordan avatar Sep 15 '22 22:09 EvilJordan

Not seeing any of the future slot logs nor any other warnings/errors but am seeing reduced attestation performance. Erigon is my EL client

RatanRSur avatar Sep 15 '22 22:09 RatanRSur

Tried -eth1-deposit-contract-max-request-size=1000 as above and it made the problem MUCH worse, more than half of attestations "future slot" issue E: now living dangerously running compiled from HEAD. Will inform if I see "future slot" exceptions still. E2: yup still happening

back-87 avatar Sep 16 '22 00:09 back-87

@back-87 If you really want to live dangerously, and since you're already running HEAD, we have been working on an optimisation to take database updates off of the critical path and we're in the process of evaluating if that will help in this case. You can enable it with --Xdata-storage-async-enabled. Please make sure you have slashing protection enabled - by not waiting for database updates to be written to disk it's possible that after an unclean shutdown you'll start up with the beacon chain having effectively rolled back a few blocks and that may lead to surround votes. Slashing protection will keep you safe and is definitely always required but it's just that tiny bit more likely to be required with this flag enabled.

And for anyone reading on, this is definitely not a recommended thing to do unless you're confident in your ability to rollback Teku and potentially do a fresh resync of the database (checkpoint sync is fine). But if you do try it please do let us know how it works for you.

ajsutton avatar Sep 16 '22 01:09 ajsutton

@ajsutton have Xdata-storage-async-enabled: true in the yaml. Didn't see anything logged to indicate it's active, but assuming it is. Will let you know if I see any further misses due to "future slot". I certainly do have slashing protection enabled.

back-87 avatar Sep 16 '22 02:09 back-87

@ajsutton 3 hours and not a single instance, I was getting misses due to this error more regularly than that before enabling Xdata-storage-async-enabled. I'll update in the morning (ET) to be certain.

back-87 avatar Sep 16 '22 05:09 back-87

Great to hear. Thank you for your help testing. We've been seeing good results on our nodes too and have a couple of other improvements we've just merged as well. Continuing to monitor as we don't want to rush something out that then causes people more issues.

ajsutton avatar Sep 16 '22 05:09 ajsutton

@ajsutton Confirmed, over 12 hours and not a single "future slot" exception. That arg certainly makes a significant positive difference for this issue. I'll leave it active until this issue is resolved and I've updated to a commit containing the fix for it.

Thanks to the Teku team for all the hard work!

As this seems to be related to disk IO I tried to do some A-B (with Xdata-storage-async-enabled and without) using iostat... Polling every second and I didn't see any huge differences between the two. I certainly see a difference in the frequency of this issue (now zero). Disk bottleneck issues would make sense given I'm running an old, cheap Adata SX8200 2TB. This has me considering using first fees from proposals on an upgrade! Cheers

back-87 avatar Sep 16 '22 14:09 back-87

@ajsutton I know a fix is in the RC, but just to increase confidence: 24 hours now not a single miss due to "future slot" exception. It was at the MOST every 5 hours, usually a few per hour.

The hypothesis is on point.

I'm thinking of recompiling from HEAD and removing that arg (sounds like RC has the fix based on discord messages).

Well done!

back-87 avatar Sep 17 '22 03:09 back-87

Yep the latest head has that arg defaulted to on (arg still exists so it can be toggled off if need be but we'll remove it soon).

Thanks for your help testing this and giving feedback. Very glad it's helping.

ajsutton avatar Sep 17 '22 03:09 ajsutton

This solved my missed attestations as well. Thanks!

W1337lt avatar Sep 18 '22 17:09 W1337lt

closing since most of the problems has been solved by async storage

tbenr avatar Mar 31 '23 07:03 tbenr