iroha icon indicating copy to clipboard operation
iroha copied to clipboard

Improve handling of task panics

Open 0x009922 opened this issue 1 year ago • 1 comments

Currently Iroha spawns many tokio::tasks in a detached way, without checking whether they panic somewhere along execution.

Here are a few examples (just cases when the handle is completely ignored; there are many other when the handle is used, but not to check whether it resolves in an error):

https://github.com/hyperledger/iroha/blob/5085ff2435bf2e412dec76649b19d3a7bf091239/p2p/src/network.rs#L111

https://github.com/hyperledger/iroha/blob/f5e3c493a6f2f336da09d75c8d00562aac8168a5/cli/src/lib.rs#L387

https://github.com/hyperledger/iroha/blob/f5e3c493a6f2f336da09d75c8d00562aac8168a5/cli/src/lib.rs#L137

https://github.com/hyperledger/iroha/blob/f5e3c493a6f2f336da09d75c8d00562aac8168a5/cli/src/lib.rs#L494

It has a few issues:

  • When some vital task panics, we don't even see that it happened, and don't see an error message.
  • It results in a non-graceful "crumbling" of the system due to some tasks relying on failed ones.

For example, in case of the NetworkBase task panicking, we don't see anything except NetworkBase must accept messages until there is at least one handle to it: SendError { .. } (which doesn't tell the cause).

Instead, it would be good:

  • To see panic messages of most tasks (ideally - of all tasks)
  • When a vital task panics, gracefully shutdown all other tasks
  • Ideally (overkill for now), adopt Supervision Tree design principle: monitor tasks execution in a centralised way with restart/shutdown strategies. This must contribute into overall fault-tolerance of Iroha significantly. For more info, see Erlang/OTP or Supervisor in Elixir

Tools to help: TaskTracker, JoinSet, CancellationToken.

0x009922 avatar Jun 06 '24 08:06 0x009922

There is also a issue that panics SHOULD be detected by panic set_hook which should print panic and shutdown iroha, but there is problems with it.

Erigara avatar Jun 06 '24 08:06 Erigara