consensus-specs add justification test

Add justification test that ensures that balances as oppose to bits are utilized when updating justified checkpoints.

In the event that bits are used for this, checkpoints can diverge when using balances only in some degenerate pre_states. That said, we do not strictly define "valid" pre-states and thus in the event that an attacker can craft a bad state, they can then craft a mismatch if this logic is not adhered to

Aug 08 '22 15:08 djrtwo

Prysm will probably fail this, and it probably gets into implementation territory to have the spec mandate how clients should behave under malformed status that should not happen at runtime. cc @terencechain

Aug 08 '22 15:08 potuz

This is a consensus issue that is regularly found in fuzzing in which Prysm disagrees on the output of a state transition. That call does require what we might considered a "malformed state", but we don't actually have any formal definition of "malformed" other than we don't tihnk it can be reached, but we don't have any formal proofs about reachability and beacon state field relationships.

In general, I think premising the output of a spec function on some not-defined-notion of welformedness makes it much harder to reason about correctness of code, makes it easier to introduce bugs, and provides an amplification vector in the event that an attacker can get the blockchain to what your code considered "malformed".

On the correctness side, fuzzers will regularly yell about this so we'll have to just ignore (which could lead to bugs due to not realizing the intricacies between two fuzzing errors). Additionally, anyone reviewing such code now has to understand your well-formed state assumptions (which no one can just look up because such assumptions are not part of the spec).

As for introduction of bugs, in this case, if in a future upgrade, there is some additional value being set within the if balance statements, then the if bit logic can easily and quickly become invalidated in a very real edge case where the denominator of the balance check can change between epochs (oh and the numerator due to slashings). I would hope that our spec tests cover this very exotic edge case in which balances shift enough between N and N+1, but no one can guarantee that the coverage would be there (it's not currently). Additionally, we would already be trained to ignore such issues in the fuzzer.

As for the amplification vector. This is a logic difference between Prysm and the spec and between Prysm and other clients. The best exploits are often when an attacker strings together a series of minor exploits to do something very unexpected. By premising your logic here upon "a wellformed state" in which logic can diverge with a "degenerate state", then you are opening yourself up to a much worse outcome if an attacker can craft a degenerate state, than if the logic was just "correct" in isolation wrt any state. That is, an attacker would not only bork the finality justifications but could also then cause a chain split which would be much harder to recover from.

Aug 09 '22 03:08 djrtwo

I believe we have a formal definition of a well defined state as one that can be reached from the genesis state by a sequence state transition. And we can formally prove that coherence of balances and bits is an invariant of state transition. Moreover, this test not only would make Prysm behavior wrong, but also would not allow clients to simply fail when they recognize a state that cannot be reached from Genesis by state transition because this invariant was violated.

Most importantly, a test like this will set a precedent that we do not want to have: that implementers need to be wary about these kinds of details when reviewing the reference specification and pull requests to this repo. Researchers or reference implementation writers often times make decisions based on readability or other criteria that makes sense for a reference implementation and that does not necessarily reflect the reality of production code. In this particular case the spec writer could have easily chosen to use bits instead of balances. Someone evaluating the PR to the repo should not be approving or rejecting based on how this Python implementation may or not later reflect in his own team's implementation. Moving towards testing this level of implementation detail necessarily leaks design decisions from the Python implementation into production clients.

Aug 09 '22 05:08 potuz

To be clear, bits vs balances are not the same functionality

    if previous_epoch_target_balance * 3 >= total_active_balance * 2:
        state.current_justified_checkpoint = Checkpoint(epoch=previous_epoch,
                                                        root=get_block_root(state, previous_epoch))
        state.justification_bits[1] = 0b1
    if current_epoch_target_balance * 3 >= total_active_balance * 2:
        state.current_justified_checkpoint = Checkpoint(epoch=current_epoch,
                                                        root=get_block_root(state, current_epoch))
        state.justification_bits[0] = 0b1

In epoch N, current_epoch_target_balance * 3 >= total_active_balance * 2 can evaluate to True. Whereas in epoch N+1, previous_epoch_target_balance * 3 >= total_active_balance * 2 can then evaluate to False, due to either slashings or active balance changes.

Thus bit and balance logic is not equivalent.

In this particular case the spec writer could have easily chosen to use bits instead of balances

Although it is true, the choice is not equivalent logic on that conditional.

Aug 09 '22 13:08 djrtwo

In epoch N, current_epoch_target_balance * 3 >= total_active_balance * 2 can evaluate to True. Whereas in epoch N+1, previous_epoch_target_balance * 3 >= total_active_balance * 2 can then evaluate to False, due to either slashings or active balance changes.

I believe this is flawed: in this particular case, then the bit at index 1 will still be set because it would have been the bit at index 0 in the previous transition, and it is currently the bit at index 1 by the bits rotation at the beginning of the function. There are no operations that "unset a bit" except the bit rotation.

In any case, here's as close to a formal proof that these are equivalent as I can get while traveling, showing that bits are equivalent to balances for the purpose of justification https://hackmd.io/IHC9F7bsQ6GTYz_JXNNA9Q?view

Aug 10 '22 11:08 potuz

believe this is flawed: in this particular case, then the bit at index 1 will still be set because it would have been the bit at index 0 in the previous transition, and it is currently the bit at index 1 by the bits rotation at the beginning of the function.

This it not my point. My point is that the conditional logic as written with the spec balance conditionals can change epoch to epoch and, thus, it is not equivalent to the bit persistence logic (as in prysm). The code segment as a whole is only equivalent in practice because the contents of the logic within the if statement result in a no-op (that is, without a degenerate state, the additional running of the logic when utilizing bits will be a no-op).

This makes assumptions about the state that are not part of the spec (and require an adhoc proof in this case), make it harder to audit the code due to the additional assumptions beyond what is written in spec, provide an amplification vector in the event that an attacker can craft such a bad state (chain split instead of just bad state), and create "false positives" that will regularly show up in fuzzing.

I don't see any reason not to squash this logic mismatch with a simple code modification.

Given the false positives that will continue to show up in fuzzing (which I consider very bad and should be eliminated), I see two paths:

merging this test vector and prysm doing a minor refactor to have this logic correct in all possible states instead of just well formed states
altering the spec to account for the assumption prysm makes about the state, adding tests for it, and changing the logic in the other 4 clients

Aug 10 '22 15:08 djrtwo

The only assumption that prysm does from the state is that it was obtained from Genesis by state transition, an ad-hoc proof shows that the bits assumption is a consequence of this. So I don't think this is an unreasonable assumption. Also I don't see why would we require anything from other clients and test this coherence between balances and bits. We can simply test this on the fuzzer, it is a single conditional to add as a valid input check to the fuzzer.

Aug 10 '22 16:08 potuz

I'd like input here from other client teams but I am happy to put this on ice until after Merge mainnet releases are complete.

Adding DO NOT MERGE tag

Aug 12 '22 13:08 djrtwo

consensus-specs consensus-specs copied to clipboard

add justification test

consensus-specs
consensus-specs copied to clipboard