elixir-omg Ensure and clarify Watcher security behavior during byzantine events

Ensure and clarify Watcher security behavior during byzantine events

Open kasima opened this issue 5 years ago • 13 comments

As a person with value in the child chain, I can have a clear understanding the current state of the watcher during byzantine events, So that I can perform the necessary actions to exit and keep my value safe

This may include:

Extra messaging returned from status.get about the byzantine events in question, e.g. "Watcher has stopped syncing on Ethereum block 4559790 due to an unchallanged_exit. You may continue to query the watcher for exit data to safely exit the chain before the finalization period ends in 4 days."
Documentation around the behavior of the Watcher when various byzantine conditions happen and what to do about them
Ensure that the watcher can continue to perform its security-critical functions in all detected byzantine conditions

Jun 26 '19 10:06 kasima

Ensure that the watcher can continue to perform its security-critical functions in all detected byzantine conditions

AFAIK, this is true, as long as the user is abiding by the protocol. In case there's a scenario when any security-critical functions become unavailable, let's open specific bug tickets instead. I propose to drop the item from TODO list

OTOH I propose to add:

analyze and improve the UX of various byzantine events scenarios - figure out what info would be useful
as part of the above I suggest doing points 1. and 2. from https://github.com/omisego/elixir-omg/issues/785#issuecomment-504947954:
- serve just a single instance of unchallenged_exit, instead of many. It would then behave like withholding and invalid_block events, which are always single too.
- strip unchallenged_exit of the metadata invalid_exit has
whenever stopping BlockGetter because of unchallenged_exit, log details of the exit being the culprit for debugging cases like the INC20190617-01
items gathered in this older issue https://github.com/omisego/elixir-omg/issues/494 (shall we conflate those into this issue here and close the old one?)

Jun 27 '19 13:06 pdobacz

Your proposal LGTM Piotr,

It is definitely unclear to the user that the BlockGetter is being blocked by the unchallenged_exit, (At least I did remember from Hoard that this was unclear). if this piece of info is explicit then it should be enough.

RE: your first point around UX of various byzantine events scenarios, I think we will need to sit down and start with listing down the proper response that the maximally security aware actors should do under each byzantine conditions and the repercussions of not following the protocol. Then from that point on we can narrow down what info will be useful to reduce the friction to follow the protocol. Do you think this is a good approach ? @pdobacz

Jul 04 '19 01:07 Pongch

if this piece of info is explicit then it should be enough.

how would you see it made explicit (except for the docs as Kasima outlined, which I suppose are currently missing)? Currently there's the log message + byzantine_events entry (until challenged). I propose to expand the log message with exit details to be able to track. Do you think this is enough?

Do you think this is a good approach ?

sounds good! So let's kick this off by writing down the first draft of the "following the protocol" guide

Jul 04 '19 08:07 pdobacz

Ok, just to consolidate all these comments and tickets. Something for @pdobacz and @Pongch. Otherwise I don't understand what needs to be done :))))

[ ] analyze and improve the UX of various byzantine events scenarios - figure out what info would be useful how do we do that, @Pongch ?
[ ] One could potentially serve just a single instance of unchallenged_exit perhaps, instead of many, to make it even more aligned. It would then behave like withholding and invalid_block events, which are always single too. So we serve them on the first occurring basis.
[ ] Also the unchallenged_exit could be stripped of the metadata invalid_exit has, to make the distinction cleaner (unchallened_exit just references the (any or oldest?) invalid_exit - {name: unchallenged_exit, because_of: {name: invalid_exit, utxo_pos: 345678}} or similar)
[ ] whenever stopping BlockGetter because of unchallenged_exit, log details of the exit being the culprit for debugging cases like the INC20190617-01 Log them where and how are they retrievable?
[ ] When calling status.get available piggybacks are reported as byzantine events How are they supposed to show up?
[ ] IFE's that are never piggybacked, don't have a notion of being finalized. Will always show up in in_flight_exits and available_piggybacks What are the consequences of this work?

If last two are ok, let's close https://github.com/omisego/elixir-omg/issues/494

Aug 19 '19 12:08 InoMurko

So we serve them on the first occurring basis.

yeah, so instead of returning late_invalid_exits_events as done now, you should just tap into this: https://github.com/omisego/elixir-omg/blob/master/apps/omg_watcher/lib/omg_watcher/exit_processor/core.ex#L408 and append a single entry of the unchallenged_exit event to the big list.

This should be rather straightforward

Log them where and how are they retrievable?

The event is already logged, the issue is that there's no details attached making the log too obscure. Log is produced in various spots when the stopping occurs, for example here: https://github.com/omisego/elixir-omg/blob/master/apps/omg_watcher/lib/omg_watcher/block_getter.ex#L208.

In this case managing the details to be logged can be a tad tricky

When calling status.get available piggybacks are reported as byzantine events

How are they supposed to show up?

Byzantine events list is our de facto "action item list", so in a way, they fit in there. It's just the name is a little bit confusing and can be also annoying, since "someone else's" available piggybacks show up, and it's just noise (I cannot piggyback someone else's stuff).

OTOH, our assumption was that it would be the client that would filter out those particular events, and only pick the one it owns.

I lack UX intuition to maneuver out of this, any ideas?

IFE's that are never piggybacked, don't have a notion of being finalized. Will always show up in in_flight_exits and available_piggybacks

What are the consequences of this work?

Not sure I understand the question. The problem is exactly as stated. An IFE that is started but never piggybacked will timeout tacitly and hence remain in the report forever (I think same happens if the piggyback is done but challenged). We need to either: 1/ do some awkward magic on the contract's end and emit an event whenever this happens 2/ track the moment of the timeout on elixir-omg's end and compare it to some notion of now and if timeout is past then don't show those IFEs anymore by tagging is_active: false.

to consolidate all these comments

I think the two first bullets from Kasima's original post went missing (the third bullet point I suggested to treat as bugs and put in separate issues, if spotted). Unless we treat those as parts of the UX item, makes sense. Anyway let's not lose sight of those requests.

Aug 19 '19 15:08 pdobacz

The event is already logged, the issue is that there's no details attached making the log too obscure. Log is produced in various spots when the stopping occurs, for example here: https://github.com/omisego/elixir-omg/blob/master/apps/omg_watcher/lib/omg_watcher/block_getter.ex#L208.

What kind of log would be sufficient? Are users aware, that the only way you can resume syncing after this log occurs is that you need to restart watcher, even if you clear actionable events from status.get?

Maybe having a raise alarm of these sorts would suffice? Then we split documentation saying, all business relevant actionable events are coming from status.get but infrastructure relevant information are retrievable from alarm.get?

Byzantine events list is our de facto "action item list", so in a way, they fit in there. It's just the name is a little bit confusing and can be also annoying, since "someone else's" available piggybacks show up, and it's just noise (I cannot piggyback someone else's stuff).

Afaik, we don't have actionable accounts - but if we had them, we could segregate events between those that are actionable or not. But we might just leave that to the API users, depends what product wants... @Pongch ?

Not sure I understand the question. The problem is exactly as stated. An IFE that is started but never piggybacked will timeout tacitly and hence remain in the report forever (I think same happens if the piggyback is done but challenged). We need to either:
1/ do some awkward magic on the contract's end and emit an event whenever this happens
2/ track the moment of the timeout on elixir-omg's end and compare it to some notion of now and if timeout is past then don't show those IFEs anymore by tagging is_active: false.

You explained what I was asking. I was having troubles figuring out what work needs to be done. But before we decide on 1 or 2. Does product have any preferences? @Pongch

I will try to hash the issues out, have exact work requirements and then create tasks here.

Aug 20 '19 11:08 InoMurko

The event is already logged, the issue is that there's no details attached ...

What kind of log would be sufficient?

Hm, I'd say anything sufficient to track the root cause (invalid exit of some sort) down. Right now, only old invalid standard exits (invalid_exit) cause unchallenged_exit, in this case the utxo_pos is used as something like an identifier (e.g. you get the challenge by that). In case of other root causes (non-canonical IFE, invalid piggyback), different pieces of information will be relevant.

To cut short: for now logging that it was an invalid_exit event for a particular utxo_pos sounds most reasonable.

Are users aware, that the only way you can resume syncing after this log occurs is that you need to restart watcher, even if you clear actionable events from status.get?

I don't think so, but this is a touchy subject, because what the users should do to stick to the protocol is mass exit, not restart. But if there's the right spot and wording to put this hint into, then maybe it's ok? Maybe this should become the part of the byzantine events playbook, as a "pro-tip"?

Aug 20 '19 11:08 pdobacz

Agree with @pdobacz, I think what is needed before this story could be tackled is to produce a byzantine events guideline, a strict Plasma MoreVP protocol that has details like the consumer should do X under byzantine conditions Y, working backwards, this will then inform the proper Byzantine UX and hence the API design for /status.get

I think we have already started a similar document before and have some byzantine event descriptions in documents here and there. We should consolidate all this.

Aug 22 '19 07:08 Pongch

A requirement (blocker) to all tasks below:

[ ] As per @Pongch, we first need to clarify all events here: https://omisego.atlassian.net/wiki/spaces/NET/pages/67928099/WIP+Following+the+Plasma+Protocol+Guide That will allow us to reason about the changes needed for API responses.
[ ] Extra messaging returned from status.get about the byzantine events in question, e.g. "Watcher has stopped syncing on Ethereum block 4559790 due to an unchallanged_exit. You may continue to query the watcher for exit data to safely exit the chain before the finalization period ends in 4 days." Requirement: Protocol Guide
[ ] Analyze and improve the UX of various byzantine events scenarios - figure out what info would be useful. A general task for findings that don't fit in any below tasks. Requirement: Protocol Guide
[ ] One could potentially serve just a single instance of unchallenged_exit perhaps, instead of many, to make it even more aligned. It would then behave like withholding and invalid_block events, which are always single too. We can serve them on the first occurring basis. Requirement: Protocol Guide
[ ] Also the unchallenged_exit could be stripped of the metadata invalid_exit has, to make the distinction cleaner (unchallened_exit just references the (any or oldest?) invalid_exit - {name: unchallenged_exit, because_of: {name: invalid_exit, utxo_pos: 345678}} or similar) Requirement: Protocol Guide
[ ] If BlockGetter stops syncing, that needs to come out on the surface of the app so that we're know WHY and HOW to fix. Requirement: Protocol Guide
[ ] When calling status.get available piggybacks are reported as byzantine events. We need to decide, based on the Protocol Guide, if we show piggybacks we're not able to ride. Requirement: Protocol Guide
[ ] tracked in https://github.com/omisego/elixir-omg/issues/1279 An IFE that is started but never piggybacked will timeout tacitly and hence remain in the report forever (I think same happens if the piggyback is done but challenged). We need to either: 1/ do some awkward magic on the contract's end and emit an event whenever this happens 2/ track the moment of the timeout on elixir-omg's end and compare it to some notion of now and if timeout is past then don't show those IFEs anymore by tagging is_active: false. Requirement: Protocol Guide

OK, I think that's it. @Pongch please advise who should tackle documenting the Protocol Guide, after that, we're able to continue on other items.

Aug 22 '19 15:08 InoMurko

@InoMurko I have updated the Protocol Guide, one thing left to do is to add some mock data from /status.get and showing additional data during the different respond steps

IMO, some of these events don't fit into the criterion of byzantine per se (for example, as you have stated with piggyback). To break the distinction even further, there seems to be 3 distinct criteria of events from the watcher API:

Unsafe chain: which is the worst case scenario. ie, unchallenged exits and block withholding
Byzantine event: bad actors trying to steal funds that people should challenge. ie, invalid exit, noncanonical_ife
Events requiring some user's actions: ie. piggyback_available
Report of watcher failure: ie. stalled_syncing

Do note that we have an issue opened for stalled syncing here: https://github.com/omisego/elixir-omg/issues/814

Aug 28 '19 12:08 Pongch

this https://github.com/omisego/plasma-contracts/pull/475 seems to help address the issue:

An IFE that is started but never piggybacked will timeout tacitly and hence remain in the report forever

Nov 22 '19 08:11 pdobacz

Took the liberty to edit the task list and mention that one item is now tracked in https://github.com/omisego/elixir-omg/issues/1279

Jan 27 '20 07:01 pdobacz

We will address Extra messaging returned from status.get about the byzantine events and Analyze and improve the UX of various byzantine events scenarios as part of the feature team (https://github.com/orgs/omisego/projects/34). The issue feels like a place for general discussion and we will not attempt to close it.

Mar 23 '20 08:03 thec00n

elixir-omg elixir-omg copied to clipboard

Ensure and clarify Watcher security behavior during byzantine events

elixir-omg
elixir-omg copied to clipboard