nimbus-eth2
nimbus-eth2 copied to clipboard
poor validator performance when bad or viable nodes present
when you have multiple beacon nodes configured for the validator, rated raver performance noticeably drops if one of the nodes is 'bad' (offline) or 'viable' (syncing)
To Reproduce Steps to reproduce the behavior:
- Platform details (OS, architecture): x86_64 Ubuntu w docker
- Branch/commit used: statusim/nimbus-validator-client:multiarch-v25.3.1
- Commands being executed: '...'
- Relevant log lines: '...'
Could you please elaborate more on this issue? Is it possible to get some metrics and/or logs before and after?
I believe the big dip shown below was from 1 of 3 nodes connected to the validator going offline.
I've also seen similar issues when resyncing 1 of 3 connected nodes. In that case removing the resyncing node and going with 2 nodes until the resync completed restored performance
I'm seeing a similar issue with the nimbus VC (25.5.0) connected to 2 BNs, Nimbus (25.5.0) and the new teku BN (also 25.5.0 :sweat_smile: )
They changed how the teku BN communicates with their new version, and it killed performance of my VC because it was constantly trying to connect to the teku BN, even though the nimbus BN was working fine.
I'm seeing something really similar here, though I'm not entirely confident it's the exact same root cause. My Nimbus setup involves Nimbus running on both the main and fallback nodes. When I reboot the fallback node, my main node misses attestations, even though its beacon is good.
I'm seeing a similar issue with the nimbus VC (25.5.0) connected to 2 BNs, Nimbus (25.5.0) and the new teku BN (also 25.5.0 😅 )
They changed how the teku BN communicates with their new version, and it killed performance of my VC because it was constantly trying to connect to the teku BN, even though the nimbus BN was working fine.
https://github.com/status-im/nimbus-eth2/releases/tag/v25.7.0 restores Teku BN compatibility.
Also, via https://github.com/status-im/nimbus-eth2/pull/7276 it might improve this in general.
If one of two all-roles BNs has a high-latency connection to the VC, performance is worsened compared to using a single low latency BN.
This makes a "fallback" BN detrimental in some circumstances - such as a user with both a home based BN and a hosted BN elsewhere on the globe.
This is easy to verify in the VC logs and still occurs with 25.7.0
Steps to reproduce: -Configure the Nimbus VC with two BNs, one of which has high connection latency (e.g. >250ms) -Check the VC logs - both BNs will now be reported as having significant time skew associated with them (probably due to the Nimbus VC waiting for responses from the high latency BN before proceeding with duties)
Suggested change - implement an additional BN role that allows the user to specify "use this BN for publish roles only, unless no other BN is usable"
I think the issue of bad performance with a 'viable' (syncing) node is fixed in 25.7.0
I think the issue of bad performance with a 'viable' (syncing) node is fixed in 25.7.0
I take that back