nimbus-eth2 icon indicating copy to clipboard operation
nimbus-eth2 copied to clipboard

Low sync committee performance with Vouch

Open yorickdowne opened this issue 1 year ago • 14 comments

Describe the bug

Nimbus 23.4.0 with Vouch 1.7.5, sync committee performance on mainnet was consistently in the 90% range. Replacing Nimbus with Teku resolved it.

Testing this on mainnet is a no-go, but: I have 1k validators on Goerli w/ Vouch. While Goerli has its own issues, maybe that's good enough to find the issue? This one is Nimbus 23.6.1.

If so, we can use this ticket to sleuth a bit.

To Reproduce

Run Vouch with multiple CL:EL, one of them being Nimbus:something.

eth-goerli-consensus-1  | INF 2023-08-19 16:45:48.883+00:00 Sync committee message sent                message="(slot: 6329629, beacon_block_root: \"fa7b0eca\", validator_index: 394114, signature: \"a400b7c2\")" delay=-3s117ms56us567ns
eth-goerli-consensus-1  | INF 2023-08-19 16:45:48.883+00:00 Sync committee message sent                message="(slot: 6329629, beacon_block_root: \"fa7b0eca\", validator_index: 473561, signature: \"b555d5cb\")" delay=-3s116ms861us507ns
eth-goerli-consensus-1  | INF 2023-08-19 16:46:00.946+00:00 Sync committee message sent                message="(slot: 6329630, beacon_block_root: \"407728aa\", validator_index: 394012, signature: \"b20739e2\")" delay=-3s53ms631us52ns
eth-goerli-consensus-1  | INF 2023-08-19 16:46:00.946+00:00 Sync committee message sent                message="(slot: 6329630, beacon_block_root: \"407728aa\", validator_index: 480741, signature: \"b0d1da70\")" delay=-3s53ms185us191ns
eth-goerli-consensus-1  | INF 2023-08-19 16:46:00.947+00:00 Sync committee message sent                message="(slot: 6329630, beacon_block_root: \"407728aa\", validator_index: 394114, signature: \"94059343\")" delay=-3s52ms976us593ns
eth-goerli-consensus-1  | INF 2023-08-19 16:46:00.947+00:00 Sync committee message sent                message="(slot: 6329630, beacon_block_root: \"407728aa\", validator_index: 473561, signature: \"b7239556\")" delay=-3s52ms748us968ns

https://goerli.beaconcha.in/validator/473561, 76% or so

yorickdowne avatar Aug 19 '23 16:08 yorickdowne

Will update Nimbus and turn on debug logs. Hopefully that sheds more light. If there's no clue, we can go to trace.

yorickdowne avatar Aug 19 '23 16:08 yorickdowne

Delay is better by a good second after restart

eth-goerli-consensus-1  | INF 2023-08-19 16:53:49.695+00:00 Sync committee message sent                message="(slot: 6329669, beacon_block_root: \"b8c7ba25\", validator_index: 394012, signature: \"8a5be233\")" delay=-2s304ms129us161ns
eth-goerli-consensus-1  | INF 2023-08-19 16:53:49.696+00:00 Sync committee message sent                message="(slot: 6329669, beacon_block_root: \"b8c7ba25\", validator_index: 480741, signature: \"a6c194f7\")" delay=-2s304ms54us679ns
eth-goerli-consensus-1  | INF 2023-08-19 16:53:49.696+00:00 Sync committee message sent                message="(slot: 6329669, beacon_block_root: \"b8c7ba25\", validator_index: 394114, signature: \"b1b8976f\")" delay=-2s303ms971us629ns
eth-goerli-consensus-1  | INF 2023-08-19 16:53:49.696+00:00 Sync committee message sent                message="(slot: 6329669, beacon_block_root: \"b8c7ba25\", validator_index: 473561, signature: \"ab2de91d\")" delay=-2s303ms921us808ns

yorickdowne avatar Aug 19 '23 16:08 yorickdowne

Initial debug logs after restart on 23.7.0

nimbus-debug.log.gz

yorickdowne avatar Aug 19 '23 16:08 yorickdowne

Checking participation on Goerli overall ... with this many blocks missing, we may not find a cause. 76% is in line with what you'd expect on a network with <80% participation.

Let me know whether you're keen to find root cause. If so, I can run something by Izzy, suggest to use 1,000 keys on mainnet as a sacrificial lamb for one sync committee period and no longer, and get you either debug or trace logs from it, whichever you prefer.

yorickdowne avatar Aug 19 '23 17:08 yorickdowne

What validator client are you using?

etan-status avatar Aug 19 '23 17:08 etan-status

Vouch 1.7.5. See first post :)

Fellow NOs opine that Vouch in particular causes Nimbus to struggle with sync committee, whereas Vouch does not cause sync committee issues with Teku, Lighthouse, Prysm, and Lodestar.

yorickdowne avatar Aug 19 '23 17:08 yorickdowne

Going through your validator history, this doesn't seem to be a Nimbus specific issue.

https://goerli.beaconcha.in/validator/473561#sync

  • Yellow "Missed" are because the block proposer failed to propose a block. Nothing you can do.
  • Red "Missed" have a block, but failed to include your signature specifically. Those are the important ones.

Red are:

  1. 6329660 --> BN was not running, as you just restarted for collecting debug logs
  2. 6329661 --> You just restarted the BN, in your logs it is evident that peers are still being looked for
  3. 6329662 --> Same, "No peers for topic, skipping publish"
  4. 6329726 --> You were missed, but overall only ~200 participation in this block, compared to the usual ~400, suggesting this to be a block proposer problem, not yours.
  5. 6329739 --> Block proposer created a block but did not include any sync committee messages, not your fault
  6. 6329761 --> Similar as fourth one, but only ~150 participation in this block. Not specific to you.
  7. 6329799 --> Same as 5, block proposer did not include any sync messages

All the others were simply missed blocks (yellow "missed"), meaning it was not possible to get your signature in.

Could you please elaborate on "Replacing Nimbus with Teku resolved it."?

etan-status avatar Aug 19 '23 17:08 etan-status

You’re looking at the Goerli validator. Let’s get back to the original mainnet use case.

I have validators running for Lido. Vouch is the VC, and then there are three CL:EL. Lighthouse:Erigon, Teku:Besu and Nimbus:Nethermind. Sync participation was sub optimal, and I’d get alerts for it from monitoring. I asked other NOs about it and heard that while Nimbus VC to Nimbus CL works perfectly, Vouch to Nimbus CL caused issues for one other NO as well. They did however not gather debug logs, and instead chose to just solve the issue by replacing Nimbus.

I did likewise and replaced all instances of Nimbus with Teku, and all the alerts stopped, everything is working well.

This GitHub issue is me offering to work with you in figuring out root cause. You are right we can’t do it on Goerli. So, I’d need to reintroduce Nimbus on mainnet for a while, gather either debug or trace logs, see whether the issue is still around. If so, remove Nimbus again, and upload logs.

Messing with sync performance in production is a little iffy, but Lido is very interested in client diversity, so I think for a time limited test, we can do this.

yorickdowne avatar Aug 19 '23 22:08 yorickdowne

One difference in behaviour between Nimbus VC and Vouch is that Nimbus VC tends to wait until 4s into the slot before sending the sync committee message, which brings the delay close to 0s.

In case of vouch, it seems to immediately send sync committee messages as soon as a new block is received, which you can confirm in the logs with the negative delays. Delays are measured relative to the 4s mark, if it's early, it's negative, if it's late, it's a positive number.

As you are suggesting that the Nimbus VC works fine, maybe vouch is simply a bit too aggressive with sending its sync committee message, and Teku maybe delays the message for a bit internally before trying to send it on the network. It is difficult to reason about that without logs, though. From your Goerli logs, it doesn't seem that we are too early, but it's earlier than what happens with the Nimbus VC.

etan-status avatar Aug 19 '23 22:08 etan-status

OK. I'll give this whirl in the coming week and we can see how it behaves.

yorickdowne avatar Aug 19 '23 22:08 yorickdowne

Getting a temperature check from the Lido community before I kick this off: https://research.lido.fi/t/mainnet-sync-committee-performance-test/5253

yorickdowne avatar Aug 21 '23 14:08 yorickdowne

We ran into another issue instead, which may however be the same issue tbh. Once #5446 is fixed we will test again and update both issues.

yorickdowne avatar Oct 03 '23 15:10 yorickdowne

@yorickdowne Do you maybe have an update on this issue? Thank you.

catwith1hat avatar May 31 '24 08:05 catwith1hat

We haven’t seen issues recently. I believe this is resolved

yorickdowne avatar May 31 '24 08:05 yorickdowne