refactor: supervise spawned tasks
Description
This PR introduces a lightweight Supervisor for Tokio tasks.
What it does:
- Monitors multiple children
- Provides a single shutdown signal for everything
- Supports graceful shutdown timeout before aborting a child
- If a child panics, initiates shutdown and exits with an error
- If a child exits before shutdown signal, also initiates shutdown and exits with an error. Note: this might not be always the desirable behaviour, but currently there are no other cases in Iroha. This behaviour could be easily extended to support refined per-child strategies.
- Logs children's lifecycle and panics
What it doesn't:
- Doesn't support restarting children. To implement that, we need a formal actor system.
Linked issue
Closes #4698 Related to #4516 (helps to identify failures)
Benefits
- Clearer shutdown traces
- Clearer Iroha lifecycle
- Less unnoticed panics
Like the interface.
However i find it a bit hard to track the logic of Supervisor itself maybe we can simplify it's implementation?
My idea is to handle all tasks inside SupervisorTask itself and reduce amount of task spawning, here is rough sketch of my idea.
What do you think?
However i find it a bit hard to track the logic of
Supervisoritself maybe we can simplify it's implementation? My idea is to handle all tasks insideSupervisorTaskitself and reduce amount of task spawning, here is rough sketch of my idea. What do you think?
I agree that the current implementation of Supervisor is somewhat clunky and not easy to follow.
Tried to rewrite with less amount of spawns and using FuturesUnordered, as in your sketch, but it turned out to be also very confusing when it came to shutdown timeouts. Abandoned this way. If you can implement it in a way so that all current tests will pass, and the implementation will also be easy to understand - please, welcome =)
Anyway, I found a middle ground and made the implementation simpler. Now Supervisor relies on JoinSet, and for each monitored child it spawns 2 tasks: one to report the result of the child, and one to implement child shutdown logic. Supervisor itself waits until this JoinSet finishes all tasks, and also waits for messages from with children results to initiate shutdown if something went wrong.
Please re-review!
Anyway, I found a middle ground and made the implementation simpler.
Indeed, implementation became much simpler and easier to grasp, i like it.
Rebased & ran all tests locally - works except 3 extra functional.