consensus-specs
consensus-specs copied to clipboard
Withdrawals into queue in single function
Rework of @rolfyone's PR (#3032)
There were some minor bugs introduced in the last one. I think I cleaned it up but have not reworked tests. I'd prefer to only do so once we have consensus on the general feature
Original comment:
Partially to initiate a discussion.
Basically use a single pipeline for withdrawal processing to simplify the flow and reduce the need for full validator scans during epoch processing.
This should make the feature much more testable amongst other benefits, and makes withdrawal of partial or full amounts an equal priority based on your position in the validator list.
The timing remains approximately equal to previous, except there may have previously been 4 exits per epoch to add that are now just done during sweeping, where previously they were done as a priority
Danny's comments:
Note, that this does have some shifts in "fairness" wrt when you become withdrawable and when you actually are withdrawn. It used to be immediate but not it could be quite a long time between when you become withdrawable and when withdrawn (on the order of a week at current vset size). Additioanlly, some people might become withdrawable sooner than others (thus exited sooner in some cases) but might then be later in the sweep.
I like the previous iteration a bit better for a few reasons:
- The two mechanisms were isolated so were easier to test on the spec side. That said reworking tests shouldn't be too hard
- The timing and fairness thing. In the prior mechanism, once you become withdrawable you get to get all of your ETH out immediately rather than waiting for what might be a very long partial withdrawals sweep. I understand the complexity of doing a full validator sweep to find the withdrawable folks but note that it's not like those that are withdrawable can change quickly, you can easily have the ones to check precomputed.
Re (2). If this is a major engineering complexity/hurdle, then I'm okay with the changes. But it does seem a bit weird to prioritize partial withdrawals (which usually all the set are ready for) over the much smaller (and more important) set of full withdrawals.
Like the simplicity of this. the downside is that full withdrawals won't necessarily be processed immediately the validator is withdrawable, but I don't see that being a major issue.
I do want to go back to MAX_WITHDRAWALS_PER_EPOCH
, however. The current value of 256 will work through the entire current validator active set in around 7 days. Looking at current rewards, an average validator would expect to earn around 0.025 ETH in that 7 day period, That seems like quite a small number as the "unit of value" to be going to the effort of transferring across chains.
The other related, but somewhat more nebulous, concern I have is that I can envisage smart contracts wanting to obtain proof on-chain that a withdrawal has come from a certain validator. Providing and verifying a proof for a 0.025 ETH amount could end up eating a fair chunk of the value of the withdrawal.
If MAX_WITHDRAWALS_PER_EPOCH
was reduced to 64 it would still get through the current active set in a month, and would be skimming values of around 0.1 ETH for active validators. It would reduce the number of withdrawal operations on-chain, and would results in proofs taking up significantly less ETH as a percentage of the transfer. Is there a reason outside of responsiveness (which I do understand) to keep this at 256?
I'll leave the same comment here as in Paul's PR, an option is to process full withrawals in the same loop as effective balance updates. That may be equivalent to the original approach without the extra loop
At least for Lodestar, (on a preliminary look, not implemented yet) the previous approach won't require a full sweep of the validator set. We do all sweeps in epoch processing in one go, essentially merging multiple spec functions into one.
At least for Lodestar, (on a preliminary look, not implemented yet) the previous approach won't require a full sweep of the validator set. We do all sweeps in epoch processing in one go, essentially merging multiple spec functions into one.
In order to do this you need to prove that doing it that way is equivalent to the specified version which explicitly adds a loop after balances changes. It needs to be checked that these processes are independent
At least for Lodestar, (on a preliminary look, not implemented yet) the previous approach won't require a full sweep of the validator set. We do all sweeps in epoch processing in one go, essentially merging multiple spec functions into one.
In order to do this you need to prove that doing it that way is equivalent to the specified version which explicitly adds a loop after balances changes. It needs to be checked that these processes are independent
Right. Then you can collect an initial list of fully withdraw-able validators at the beginning of epoch processing, and then re-check that balance > 0. The diff between those lists will be none in most cases. I can only see cases like a validator being slashed for all its balance causing a diff there
To me, I don't see why full withdrawals should be prioritised, although there are definite periods where you may exit second and get processed before an earlier full exit.
Full withdrawals getting bumped to the top of the queue immediately does basically unfairly advantage them compared to active contributing members of the network, I'm not sure why the prioritisation call was made in the initial instance... Processing them during updated balance calculation is probably ok, but the spec calling out that it's right at the end and the code being completely at odds does seem to be a little odd (if balance calculation also exits validators)...
The 'fairness' is not the reason I made the change though, it was purely reducing processing for the epoch transition.
I fixed the tests. Some notes:
- Merged
run_process_full_withdrawals
andrun_process_partial_withdrawals
intorun_process_withdrawals_into_queue
- Since full and partial withdrawals are now in the same function, I removed
randomize_state
call fromrun_random_partial_withdrawals_test
becauserandom_state
might also trigger the full withdrawal cases and lead to unexpected partial withdrawal num.
IMO opinion this PR does not add any tangible benefits in performance, while slowing full withdrawals for no reason. Partial withdrawals are processed "slowly" because we have to. Full withdrawals are processed "faster" because we can. It's purely technical, no fairness argument influences those facts.
I think we should benchmark this. How long does it realistically take to loop through the validator set? Isn't this premature optimization?
We know in teku that looping through the full set of validators is expensive and it makes any algorithm O(n) in the number of validators. So yes, we would definitely have to use an optimised version of this.
The argument for having a separate loop for full vs partial withdrawals was to make testing simpler, but that isn't the case once we need to add a cache for which validators have become withdrawn since tests would need to cover that caching. In other words, a separate loop actually expands the scope of testing required rather than reducing it since we now need to cover every way a validator may become withdrawable.
Besides which, we aim for simplicity in the way consensus works and having a single approach to withdrawal is simpler than having two different approaches depending on why the validator has balance that could be withdrawn.
I tried to make a summary table:
Solution | Pros | Cons |
---|---|---|
Separate partial func and full func | prioritize full withdrawal? (but most devs don't think it's benefit) | [@ajsutton] we're all saying we'll optimise it in clients, but then the tests for it will be written based on the simple implementation in the spec and won't provide coverage for the optimisations actual clients would need to do. Given Danny's argument for the separatation of withdrawals was to make it simpler to test, the fact it makes the tests less effective seems to be shooting ourselves in the foot. |
#3042 | simplest | [@dapplion] I don't find justifiable to add delays on withdrawals for the rationale of the PR. There are optimizations possible that achieve the same end without worsen the validator UX [@etan-status] Downside is that it adds this arbitrary delay of 3-4 days on avg (but more in future) for trying to get out, during which no rewards are earned anymore. This means that exits need to be scheduled as late as possible into a withdrawal interval for a specific validator to not miss out on those rewards. This can be quite tricky because of the exit queue. Tracking the exited-but-not-yet-withdrawn validators separately is straight forward as an optimization. But please provide tests for edge case such as depositing into a validator that has already exited or fully withdrawn to re-enter the priority queue accordingly. |
#3042 + loop in effective balance updates (@potuz's suggestion) | most efficient probably closer to client's optimization? |
[hww] more difficult to debug and test (but only in specs?) |
My impression:
- Clients will optimize the loop anyway, so the spec is less likely to be as same as the production implementation.
- If (1) is true, the epoch processing tests (process_withdrawals_into_queue) may not be applied on the client side anyway.
- Most devs like the simplicity of #3042 on the spec side.
- #3042 may introduce some side effects, i.e., the delay. It needs to be tested properly if we go with #3042.
My read is that it’s slightly toward 👍…? 😬
There are good arguments on both sides here and I'll chime in with another view against this change :)
I think we should prioritize spec simplicity above all else, weighted against speed to ship the feature to mainnet.
There will always be things we need to do to optimize implementations in clients so I don't think this is a big ask for client teams. Caches etc have been a large source of bugs in the past and to mitigate the concern we develop/maintain a large corpus of consensus test cases. And the more the spec itself is optimized for the production/maintenance of these test cases, the better mitigation it provides against bugs due to implementation complexity.
And alongside this, I want to get to a place where we stop changing the capella spec as soon as possible so there is something stable for client teams to target by end of the year. And the best way to get stability is to stop changing the spec, i.e. do not make this change.
One big difference that this PR introduces is that the withdrawal queue is much smaller. This is the first time we have a queue in the specs and I am worried about memory fragmentation. I started implementing this in Go and all implementations we are evaluating either leak memory or suffer from continuous reallocations. Either way, if this PR passes I think it would be good to put WITHRAWAL_QUEUE_LIMIT = MAX_WITHRAWALS_PER_EPOCH
or some aggressive limit like this, effectively capping the queue size to a constant bound since anyway we add to this queue on epoch transition and we cannot empty it during the coming epoch.
Having the queue of constant size allows us to optimize this further by keeping a rolling index and adding on place. This would be application specific, but I imagine all languages will benefit from having a constant size rather than potentially a big chunk of the validator set.
Coming to think about this: if we go with rolling sweeps as in this PR, there's no need for the queue at all: knowing what are the next withdrawal indices determines uniquely the next withdrawals that a block can include, so we can just check against that and there's not even the need to keep the queue in the state it seems.
@potuz One of the nice things about the queue is that it fixes the balance
of the withdrawal at the point where it was processed in process_epoch
. One thing that concerns me about the queueless approach (and https://github.com/ethereum/consensus-specs/pull/3068) is that we have to deal with the complexity of balances changing after every block, meaning that whatever cache we use for withdrawals needs to be updated on every slot and needs to handle more frequent re-orgs, etc. It makes withdrawals feel a lot more like attestations, which we know are more complex to pack than e.g. voluntary exits which are epoch-based.
Balances can already change on each slot due to slashing.
@etan-status That doesn't effect the queued approach though, because the balance decrement for the withdrawal happens immediately on the epoch transition. If a validator that is getting partially withdrawn gets slashed before their withdrawal is processed then the slashing just applies to the remaining balance (which I think is fine).
closing due to #3068