lighthouse icon indicating copy to clipboard operation
lighthouse copied to clipboard

Consider reducing the frequency of pending validator indices queries from validator client

Open jimmygchen opened this issue 2 years ago • 4 comments

Description

Currently the Validator Client (VC) polls the validator states for all inactive validators from the Beacon Node (for indices retrieval), once every slot (12 seconds on mainnet)

https://github.com/sigp/lighthouse/blob/f16795183564a8487c32890f511a24d6abac82e4/validator_client/src/duties_service.rs#L279-L292

This is mostly not an issue until the number of inactive validator validators reaches a large number ~1000, which is probably quite rare. However we've recently seen some performance issues when the endpoint beacon/states/{state_id}/validators/{validator_id} is called repeatedly in a short period of time. Here's a script created by @michaelsproul to spam this endpoint aggressively, and it turns out this could cause an OOM on the beacon node.

There has been some discussions on how to improve this, potentially queuing the requests, however they might take a while to implement. In the mean time, we can probably reduce the frequency of this query to once or twice per epoch to reduce the performance impact on the node, as validator activation only happens once every epoch, and it may not be necessary to query the indices so often.

jimmygchen avatar Jun 09 '23 13:06 jimmygchen

One way to reduce the frequency of polling would be to make an assumption that the VC only needs to know a validator index for an active validator.

Validators are added to the BeaconState via Deposit objects in a BeaconBlock. This happens on a per slot basis, so new validators can land in the BeaconState (i.e. be assigned an index) each slot. However, validators are activated on a per epoch basis (process_epoch > process_registry_updates).

So, it seems that it would be safe for us to:

  1. Poll for all validator indices on startup.
  2. When we've received a None response for a validator, poll each epoch.
    • I think we should consider not polling in the first slot of the epoch. That slot is known to be a weak point for us. Validators are activated at least 1 + MAX_SEED_LOOKAHEAD == 5 epochs after they first appear in the BeaconState so there's no rush to discover the validator.

There's an edge-case here where a syncing BN might return None, then import several epochs of the chain and then return Some and that validator is already active. In per-slot polling we would resolve this within a slot, but with per-epoch processing it takes an epoch to resolve. I'm not sure how I feel about this edge-case, it's hard to weigh off the costs of optimising for weird sync cases vs optimising for the case of big operators running lots of undeposited validators.

paulhauner avatar Apr 15 '24 00:04 paulhauner

Thanks Paul! As discussed the above edge case is quite unlikely and it's a much more common scenario for validators to get their infrastructure running before even making deposits. So the proposal above sounds like a good way to go 👍

jimmygchen avatar Apr 15 '24 05:04 jimmygchen

What's the UX benefit of discovering the deposit inclusion immediately on the next slot? You won't be activated until your deposit is finalized anyway, there's nothing useful the VC can do besides logging

dapplion avatar Apr 15 '24 05:04 dapplion

What's the UX benefit of discovering the deposit inclusion immediately on the next slot?

That's my argument here: https://github.com/sigp/lighthouse/issues/4388#issuecomment-2054233516

I think per-slot discovery has some benefits in very rare edge cases (e.g., syncing several epochs of the canonical chain and then switching to it). However that's very low impact for the operator and insignificant for the chain as a whole, so I'd say we let the edge-case suffer.

paulhauner avatar Apr 15 '24 05:04 paulhauner

Closed by #5628

chong-he avatar May 26 '24 23:05 chong-he