Deleted validator can leave residual OutstandingDowntime=true state on consumer leading to fault if new validator added

Open andrey-kuprianov opened this issue 3 years ago • 1 comments

Surfaced from @informalsystems security feedback on Interchain Security, at commit dc19c57f

Problem

It is possible for an ICS consumer to reach the state when a validator won't be slashed for arbitrary number of liveness infractions. The problem arises because of a particular way of handling slashing acknowledgements from provider to consumer chains, and may happen when a validator is tombstoned, but then created anew with the same consensus address.

Closing criteria

The logic around handling of slashing acknowledgements is reworked, and all combinations of slashing events are thoroughly examined and tested.

Problem details

The following scenario is possible:

Consumer chains A and B are connected to the Provider; Validator V is in the active set
Evidence of Validator V committing double sign on A is submitted; this gets propagated via Slash packet to Provider.
Slash packet is processed on consumer in HandleSlashPacket(), and V gets tombstoned on Provider. After some delay, this is propagated to Consumers.
Validator V commits liveness infraction on chain B. Slash packet is sent to Provider in SendSlashPacket(). OutstandingDowntime flag is set for V's address
- From that point on, OutstandingDowntime(V) is never cleared on chain B (the flag is saved by consensus address)
When Provider handles SlashPacket, vaildator is already tombstoned, the function return early, and thus no SlashAck(V) is saved
After deleting validator V, a new validator is created with the same consensus address
From now on, V can commit any liveness infractions on chain B without being punished, as SendSlashPacket() returns early because OutstandingDowntime(V) is true.

ICS Consumer liveness bug

Some notes:

Having two consumer chains (in particular chain A) in the above scenario is not necessary; it only may make this scenario easier to achieve;
The crucial element is the existence of OutstandingDowntime flag on consumer, which, when set, prevents any slashing events from being sent.
Another crucial element is the existence of several early returns in function HandleSlashPacket(), which prevent proper finalization of all tasks, in particular submission of Slashing acknowledgements (e.g., initially we've suspected the scenario with another early return if a validator is unbonded).

We recommend the following:

rework the logic around slashing for downtime. While the original motivation of not slashing multiple times for downtime might be a valid one, the implementation of that logic via the simple OutstandingDowntime flag on consumer chains doesn't seem to be the optimal solution.
avoid early returns in functions; for clean processing try to employ Golang's defer()

P.S. We thank Daniel Tisdall (@danwt) for helping us to navigate through the ICS code, for clarifying that our original suspicion of a bug because of unbonded validator is infeasible, and for pointing out that the bug is still feasible with tombstoned validator.

TODOs

[x] Labels have been added for issue
[x] Issue has been added to the ICS project

Nov 18 '22 21:11 andrey-kuprianov

I think this might be technically valid but is very far fetched because validators are rarely, if ever actually deleted on the provider, which is necessary for this to occur.

Nevertheless this should be checked. Moreover I think a quick fix is for consumers to delete OutstandingDowntime for a validator when it receives a 0 power update for the validator. More generally I think we could simplify overall by deleting OutstandingDowntime alltogether and achieving similar or equivalent semantics via some entirely provider based mechanism.

PS. There are some technicalities in the order of steps listed that aren't accurate. For example step 2 cannot preceed step 3 because if the tombstone is propogated to consumers then the validator is not active on consumer B and cannot fail to be live. However, the general idea of the execution is correct.

Nov 18 '22 22:11 danwt