lighthouse
lighthouse copied to clipboard
Consider reducing the frequency of pending validator indices queries from validator client
Description
Currently the Validator Client (VC) polls the validator states for all inactive validators from the Beacon Node (for indices retrieval), once every slot (12 seconds on mainnet)
https://github.com/sigp/lighthouse/blob/f16795183564a8487c32890f511a24d6abac82e4/validator_client/src/duties_service.rs#L279-L292
This is mostly not an issue until the number of inactive validator validators reaches a large number ~1000, which is probably quite rare. However we've recently seen some performance issues when the endpoint beacon/states/{state_id}/validators/{validator_id} is called repeatedly in a short period of time. Here's a script created by @michaelsproul to spam this endpoint aggressively, and it turns out this could cause an OOM on the beacon node.
There has been some discussions on how to improve this, potentially queuing the requests, however they might take a while to implement. In the mean time, we can probably reduce the frequency of this query to once or twice per epoch to reduce the performance impact on the node, as validator activation only happens once every epoch, and it may not be necessary to query the indices so often.
One way to reduce the frequency of polling would be to make an assumption that the VC only needs to know a validator index for an active validator.
Validators are added to the BeaconState via Deposit objects in a BeaconBlock. This happens on a per slot basis, so new validators can land in the BeaconState (i.e. be assigned an index) each slot. However, validators are activated on a per epoch basis (process_epoch > process_registry_updates).
So, it seems that it would be safe for us to:
- Poll for all validator indices on startup.
- When we've received a
Noneresponse for a validator, poll each epoch.- I think we should consider not polling in the first slot of the epoch. That slot is known to be a weak point for us. Validators are activated at least
1 + MAX_SEED_LOOKAHEAD == 5epochs after they first appear in theBeaconStateso there's no rush to discover the validator.
- I think we should consider not polling in the first slot of the epoch. That slot is known to be a weak point for us. Validators are activated at least
There's an edge-case here where a syncing BN might return None, then import several epochs of the chain and then return Some and that validator is already active. In per-slot polling we would resolve this within a slot, but with per-epoch processing it takes an epoch to resolve. I'm not sure how I feel about this edge-case, it's hard to weigh off the costs of optimising for weird sync cases vs optimising for the case of big operators running lots of undeposited validators.
Thanks Paul! As discussed the above edge case is quite unlikely and it's a much more common scenario for validators to get their infrastructure running before even making deposits. So the proposal above sounds like a good way to go 👍
What's the UX benefit of discovering the deposit inclusion immediately on the next slot? You won't be activated until your deposit is finalized anyway, there's nothing useful the VC can do besides logging
What's the UX benefit of discovering the deposit inclusion immediately on the next slot?
That's my argument here: https://github.com/sigp/lighthouse/issues/4388#issuecomment-2054233516
I think per-slot discovery has some benefits in very rare edge cases (e.g., syncing several epochs of the canonical chain and then switching to it). However that's very low impact for the operator and insignificant for the chain as a whole, so I'd say we let the edge-case suffer.
Closed by #5628