lighthouse icon indicating copy to clipboard operation
lighthouse copied to clipboard

Reduce frequency of polling unknown validators to avoid overwhelming the Beacon Node

Open jimmygchen opened this issue 1 year ago • 4 comments

Issue Addressed

Addresses #4388.

A large number of unknown validators on a VC is known to overwhelm the beacon node because the VC triggers a retrieval for each validator per slot. This PR reduces the frequency to one query per unknown validator each epoch.

Proposed Changes

  1. Poll for all validator indices on startup (unchanged)
  2. If any validator is unknown, register to poll again in the 32 slots (1 epoch) instead of the next slot.
  3. Avoid polling on the first 1st slot of epoch.

For more details and rationale, see comment here.

jimmygchen avatar Apr 23 '24 06:04 jimmygchen

I'll do a bit of manual testing next week before we merge this!

jimmygchen avatar Apr 26 '24 07:04 jimmygchen

@chong-he found an issue during testing and it looks like the VC re-queries the BN during the first slot of the epoch, I'll look into this.

jimmygchen avatar Apr 30 '24 23:04 jimmygchen

Upon investigating the logs, I think it is working as intended, and the poll in the first slot of epoch is skipped. Will wait for CK to confirm.

However, I think there's likelihood that this could happen due to the async nature of this function, in the scenario where BN returns a response really late - basically if iterating through all the validators takes close to 12 seconds, then we could be querying in a new slot, which could potentially be first slot of a new epoch.

This is because of the calculation outside the for loop: https://github.com/sigp/lighthouse/blob/ce914d191d804a765c573843c3c7448facd04ccb/validator_client/src/duties_service.rs#L483-L502

For more accuracy, I'll move the calculation into the for loop given we await on each validator query and the slot status could be inaccurate if we have a large list or slow BN.

jimmygchen avatar May 06 '24 05:05 jimmygchen

@chong-he found an issue during testing and it looks like the VC re-queries the BN during the first slot of the epoch, I'll look into this.

Apologies! I was looking at the wrong slot time.

The VC does not re-query on the first slot in the next epoch (even though we start the VC in the first slot). It waits until the 12s has lapsed and only re-query the BN about the status (i.e., query at the second slot of an epoch). The VC also only query once per epoch, which is a great improvement.

All good upon testing, this PR is good to go

chong-he avatar May 06 '24 23:05 chong-he

sorry looks like cargo fmt is failing because I messed up the indentation in my suggestion

michaelsproul avatar May 21 '24 04:05 michaelsproul

My bad, should've checked myself :P thanks!

jimmygchen avatar May 21 '24 06:05 jimmygchen

@mergifyio queue

michaelsproul avatar May 22 '24 00:05 michaelsproul

queue

✅ The pull request has been merged automatically

The pull request has been merged automatically at 52e31121df5ba74442d37ed9ec1812ee0bdad44a

mergify[bot] avatar May 22 '24 00:05 mergify[bot]