Deal with non-finalization that spans more than one weak subjectivity period
The weak subjectivity spec currently does not really define this behaviour, leading to client implementations potentially being inconsistent and dangerous . For example, my teku client simply went offline because there was an extended non-finality period: https://github.com/PegaSysEng/teku/issues/3005
As far as I can see, there are two possible behaviours we can specify:
- Fully observe WS during non-finality. This would mean that during non-finality, clients should store the epoch
current_epoch - compute_weak_subjectivity_period(state)as their WS checkpoint every epoch, and never revert beyond this even if the fork choice rule gives a different result. This should probably also be noted in the fork choice rule.
Then we would need to clarify that more generally, a WS checkpoint is not necessarily (but preferably) a finalized epoch in the WS spec (it currently does not mention it, but I think it is assumed by many to be a finalized epoch).
- Do not observe WS during non-finalization. Clarify that a WS checkpoint is always a finalized epoch, and after the WS checkpoint, the fork choice should prevail, even if it means a reversion longer than the WS period.
I would prefer option 2, because:
- Weak subjectivity periods are less meaningful when the chain is not finalizing -- in particular, no new validators that haven't already been committed to can be added
- It allows better automatic resolution of chain splits (e.g. geographical) spanning longer periods.
From what I can see, there is only one case where this leads to undesirable behaviour: If 51%+ of validators go offline for a long time, they may then decide they do not like the resulting chain of 49%- of the validators building a chain with their deposits highly diluted, and attack this chain. I consider this situation much less likely than the case of two chains being built during a geographic split.
- Do not observe WS during non-finalization. Clarify that a WS checkpoint is always a finalized epoch, and after the WS checkpoint, the fork choice should prevail, even if it means a reversion longer than the WS period.
I agree with this option. This is the expected behavior as per the current WS spec.
Clients teams should note that the current WS sync only concerns itself with WS sync when the client is started. Unless explicitly mentioned in the spec, do not implement any additional WS behavior, as this may lead to fork choice deadlocks, client sync failures, or other misc. issues.
Advanced WS behavior is a topic of discussion and will be included in the WS spec when finalized.
I think this would need some clarification. I'm happy to create a PR of what's needed in my opinion.
Clarify that a WS checkpoint is always a finalized epoch, and after the WS checkpoint, the fork choice should prevail, even if it means a reversion longer than the WS period.
This is unsafe due to the same reasons it's unsafe to sync from outside of the WS period. If non-finality is longer than WS period, then a minority attacker can construct an alternate chain where they have become the majority and begin to finalize. If you can re-org deeper than WS period during non-finality, then you could reorg to such a chain that an attacker constructed for "free"
This is unsafe due to the same reasons it's unsafe to sync from outside of the WS period.
Well, I argue it's not unsafe because there are no safety guarantees while chains aren't finalized.
But if you want to make the "safe" behaviour what you're suggesting, then to be fully consistent you should be in favour of option 1., which means even clients that are online for the whole period will make WS fallback checkpoints beyond which they would not revert?
Clarify that a WS checkpoint is always a finalized epoch
one difficulty with this approach is that it basically invalidates the finalized_checkpoint field in the state object - if I'm looking a head, I now can no longer trust that the state contains canonical information about what the finalized checkpoint is, and I need to go out-of-band to fetch it - in the case where the user supplied one, this is feasible, but we start automatically checkpointing weak subjectivity, it will create communication difficulties between clients, explorers etc - the way to move forward with option 1 would be to modify the state transition function to update finalized checkpoint, rather than an "implementers recomendation".
I am closing this issue because it seems stale. Please, do not hesitate to reopen it if this is a mistake