teku icon indicating copy to clipboard operation
teku copied to clipboard

Unable to produce sync committee messages

Open jframe opened this issue 2 years ago • 2 comments

Description

Teku was unable to produce sync committee messages. This is split out this issue https://github.com/hyperledger/besu/issues/4399

From the related Besu ticket

Ok so the key thing is that when the VC was trying to create the sync signature the beacon node had already imported a block for a later slot (the vc and bn are separate components even when run in the same process), so it gets:

Chain head has advanced beyond slot

It's not entirely clear why that happened but the logs show the validator client wound up catching up and producing sync messages correctly again. That may be tied up with some of the performance issues we've seen with teku and are still investigating though it's not the normal symptoms we get. We'll have to dig into in more depth to know exactly what's happened but at this stage I don't think it's a besu issue and we'll probably pick this up from the teku side to investigate further.

jframe avatar Sep 16 '22 01:09 jframe

Just so clear with timeframe from Teku logs sent:

All times Eastern -04:00 UTC

~8:00 AM restart due to Merge error. ~10:30 AM stop Teku & Besu to sync Geth ~12:15 PM Start Teku now using Geth ~EOL - currently utilizing Geth

Propulsions avatar Sep 16 '22 01:09 Propulsions

What was wrong: There were sync committee assignments and no SyncCommitteeMessage's were produced for a few hours:

2022-09-15 10:08:23.933-04:00 | SlotEventsChannel-10 | INFO  | teku-status-log | Sync committee performance: epoch 146944, expected 32, produced 0, correct 0, included 0 (0%)
2022-09-15 10:14:47.980-04:00 | SlotEventsChannel-10 | INFO  | teku-status-log | Sync committee performance: epoch 146945, expected 32, produced 0, correct 0, included 0 (0%)
2022-09-15 10:21:12.154-04:00 | SlotEventsChannel-10 | INFO  | teku-status-log | Sync committee performance: epoch 146946, expected 32, produced 0, correct 0, included 0 (0%)
2022-09-15 10:27:36.029-04:00 | SlotEventsChannel-10 | INFO  | teku-status-log | Sync committee performance: epoch 146947, expected 32, produced 0, correct 0, included 0 (0%)
2022-09-15 10:33:59.951-04:00 | SlotEventsChannel-10 | INFO  | teku-status-log | Sync committee performance: epoch 146948, expected 32, produced 0, correct 0, included 0 (0%)

At the same time there were no errors in validator log linked with this issue. Attestations and aggregates were correctly produced, so it's not the case when the head is behind. Investigating into possible reasons of such behavior. My ideas:

  • validation in SyncCommitteeMessageValidator.validate() comes to ignore or reject. I've sorted out all checks in validation, it shouldn't failed. Had an idea about time discrepancy, but Besu log shows that all blocks are timed correctly with the network
  • PostSyncCommittees was called but returned OK while not producing SyncCommitteeMessages - I've not found the way where it could happend out of validation
  • PostSyncCommittees was not called by some reason or called with an empty list
    • ChainHeadTracker is used only for SyncCommitteeMessages - I don't see where it could fail silently
    • forkInfo in SyncCommitteeProductionDuty is from previous fork. It should fail in this case for attestations too
    • assignments were not empty as client posted expected

Here I stuck. I'm out of ideas.

zilm13 avatar Oct 26 '22 14:10 zilm13

Referencing #6345 in case it might be related.

benjaminion avatar Oct 27 '22 07:10 benjaminion

Yeah, I saw this one but it wasn't silent. We have client logs for this issue and no warnings in it

zilm13 avatar Oct 27 '22 12:10 zilm13

Seeing this issue after upgrading to v22.10.2 and using Teku <> Geth

tiagocmachado avatar Oct 28 '22 13:10 tiagocmachado

@tiagocmachado no errors and similar performance? Sync committee performance: epoch 146944, expected 32, produced 0, correct 0, included 0 (0%)

zilm13 avatar Oct 28 '22 13:10 zilm13

@tiagocmachado no errors and similar performance? Sync committee performance: epoch 146944, expected 32, produced 0, correct 0, included 0 (0%)

No errors, but also not a 0% performance. Around 70%, although before upgrading was around 100% performance.

tiagocmachado avatar Oct 28 '22 13:10 tiagocmachado

It looks like we have 2 issues here when following initial issue and follow-up:

  • sync committee duties are assigned but messages are not produced with 0 errors
  • sync committee messages are not produced with error "Chain head has advanced beyond slot"

The latter happened when EL was changed from Besu to Geth and there could be some initial CL-EL misalignment which was settled by itself in an epoch, so I think the first one is more critical

zilm13 avatar Oct 28 '22 16:10 zilm13

@tiagocmachado no errors and similar performance? Sync committee performance: epoch 146944, expected 32, produced 0, correct 0, included 0 (0%)

No errors, but also not a 0% performance. Around 70%, although before upgrading was around 100% performance.

Ah ok, so it's definitely not expected to have 100% sync committee performance and it can be quite variable. Sync committee messages can only be included in the next block and only if they have the correct head, so it's the equivalent of requiring absolutely perfect attestation performance. If a block is published late or there's an empty slot you're 100% guaranteed to miss that sync committee message and there's nothing you can do about it. You also produce a sync committee message every slot so the performance is going to be lower in the epoch after you restart and may be particularly variable for a while after that while your node finds the best peers etc.

This issue is specifically about having no sync committee messages even produced which suggests the validator client's duty scheduling has gone wrong somewhere.

ajsutton avatar Nov 14 '22 01:11 ajsutton

And looking at the initial logs where we get the "Chain head has advanced beyond slot" messages, those messages are in fact being produced after the slot in question has completed. So something was happening that caused the VC to fall behind processing its duties and the chain head had legitimately advanced before the duty could be performed. It gradually caught up and eventually resumed processing duties successfully. Certainly not ideal to fall behind like that but given the performance issues that were happening just after the merge not entirely unexpected and significant optimisations have gone in since.

ajsutton avatar Nov 14 '22 01:11 ajsutton