teku
teku copied to clipboard
Unable to produce sync committee messages
Description
Teku was unable to produce sync committee messages. This is split out this issue https://github.com/hyperledger/besu/issues/4399
From the related Besu ticket
Ok so the key thing is that when the VC was trying to create the sync signature the beacon node had already imported a block for a later slot (the vc and bn are separate components even when run in the same process), so it gets:
Chain head has advanced beyond slot
It's not entirely clear why that happened but the logs show the validator client wound up catching up and producing sync messages correctly again. That may be tied up with some of the performance issues we've seen with teku and are still investigating though it's not the normal symptoms we get. We'll have to dig into in more depth to know exactly what's happened but at this stage I don't think it's a besu issue and we'll probably pick this up from the teku side to investigate further.
Just so clear with timeframe from Teku logs sent:
All times Eastern -04:00 UTC
~8:00 AM restart due to Merge error. ~10:30 AM stop Teku & Besu to sync Geth ~12:15 PM Start Teku now using Geth ~EOL - currently utilizing Geth
Teku Logs teku_sync_committee_errors.log
What was wrong:
There were sync committee assignments and no SyncCommitteeMessage
's were produced for a few hours:
2022-09-15 10:08:23.933-04:00 | SlotEventsChannel-10 | INFO | teku-status-log | Sync committee performance: epoch 146944, expected 32, produced 0, correct 0, included 0 (0%)
2022-09-15 10:14:47.980-04:00 | SlotEventsChannel-10 | INFO | teku-status-log | Sync committee performance: epoch 146945, expected 32, produced 0, correct 0, included 0 (0%)
2022-09-15 10:21:12.154-04:00 | SlotEventsChannel-10 | INFO | teku-status-log | Sync committee performance: epoch 146946, expected 32, produced 0, correct 0, included 0 (0%)
2022-09-15 10:27:36.029-04:00 | SlotEventsChannel-10 | INFO | teku-status-log | Sync committee performance: epoch 146947, expected 32, produced 0, correct 0, included 0 (0%)
2022-09-15 10:33:59.951-04:00 | SlotEventsChannel-10 | INFO | teku-status-log | Sync committee performance: epoch 146948, expected 32, produced 0, correct 0, included 0 (0%)
At the same time there were no errors in validator log linked with this issue. Attestations and aggregates were correctly produced, so it's not the case when the head is behind. Investigating into possible reasons of such behavior. My ideas:
- validation in SyncCommitteeMessageValidator.validate() comes to
ignore
orreject
. I've sorted out all checks in validation, it shouldn't failed. Had an idea about time discrepancy, but Besu log shows that all blocks are timed correctly with the network - PostSyncCommittees was called but returned OK while not producing SyncCommitteeMessages - I've not found the way where it could happend out of validation
- PostSyncCommittees was not called by some reason or called with an empty list
-
ChainHeadTracker
is used only for SyncCommitteeMessages - I don't see where it could fail silently - forkInfo in SyncCommitteeProductionDuty is from previous fork. It should fail in this case for attestations too
- assignments were not empty as client posted
expected
-
Here I stuck. I'm out of ideas.
Referencing #6345 in case it might be related.
Yeah, I saw this one but it wasn't silent. We have client logs for this issue and no warnings in it
Seeing this issue after upgrading to v22.10.2
and using Teku <> Geth
@tiagocmachado no errors and similar performance?
Sync committee performance: epoch 146944, expected 32, produced 0, correct 0, included 0 (0%)
@tiagocmachado no errors and similar performance?
Sync committee performance: epoch 146944, expected 32, produced 0, correct 0, included 0 (0%)
No errors, but also not a 0% performance. Around 70%, although before upgrading was around 100% performance.
It looks like we have 2 issues here when following initial issue and follow-up:
- sync committee duties are assigned but messages are not produced with 0 errors
- sync committee messages are not produced with error "Chain head has advanced beyond slot"
The latter happened when EL was changed from Besu to Geth and there could be some initial CL-EL misalignment which was settled by itself in an epoch, so I think the first one is more critical
@tiagocmachado no errors and similar performance?
Sync committee performance: epoch 146944, expected 32, produced 0, correct 0, included 0 (0%)
No errors, but also not a 0% performance. Around 70%, although before upgrading was around 100% performance.
Ah ok, so it's definitely not expected to have 100% sync committee performance and it can be quite variable. Sync committee messages can only be included in the next block and only if they have the correct head, so it's the equivalent of requiring absolutely perfect attestation performance. If a block is published late or there's an empty slot you're 100% guaranteed to miss that sync committee message and there's nothing you can do about it. You also produce a sync committee message every slot so the performance is going to be lower in the epoch after you restart and may be particularly variable for a while after that while your node finds the best peers etc.
This issue is specifically about having no sync committee messages even produced which suggests the validator client's duty scheduling has gone wrong somewhere.
And looking at the initial logs where we get the "Chain head has advanced beyond slot" messages, those messages are in fact being produced after the slot in question has completed. So something was happening that caused the VC to fall behind processing its duties and the chain head had legitimately advanced before the duty could be performed. It gradually caught up and eventually resumed processing duties successfully. Certainly not ideal to fall behind like that but given the performance issues that were happening just after the merge not entirely unexpected and significant optimisations have gone in since.