go-spacemesh icon indicating copy to clipboard operation
go-spacemesh copied to clipboard

Manage running goroutines and graceful shutdown

Open noamnelke opened this issue 4 years ago • 7 comments

Motivation

Graceful shutdown is proving to be a challenge when multiple goroutines are running in the background. This has two parts:

  1. Signaling a shutdown and having each service respond by terminating gracefully (this part is working, but each service has its own, slightly different, implementation).
  2. being able to know when all services have completed.

This is important in production, where we want to ensure no data is lost due to a dirty shutdown, and in tests where we want to terminate quickly and cleanly without having to add unnecessary waiting periods.

Method

We want to integrate Tomb, a package for handling clean goroutine tracking and termination.

Tomb provides a Go() method that's intended to replace calling the go keyword directly to start goroutines. Internally, it uses a waitgroup to track how many goroutines have been started and how many have completed. It can also provide the first error that triggered a shutdown, if it was due to an error.

Tomb should be integrated in a single module first and then additional PRs can integrate it into more modules. Eventually:

  • No goroutine should be started using a bare go keyword.
  • All method calls that can accept a Context should receive one, provided by Tomb.
  • All select statements should have an early termination clause using tomb.Dying() (a method returning a channel that's closed when the Tomb is killed). If this clause is invoked, an ErrDying (a Tomb constant) should be returned by the goroutine and Tomb knows to ignore it as a kill reason.
  • All received context objects (specifically in api handlers) should be wrapped using tomb.Context(ctx) if used (I think they're never used as of writing this).
  • New contexts should never be generated from scratch and we should always use the Tomb for context (this happens mostly in the P2P module, but also in the PoET client).

WIP

  • [ ] Add specific tasks (where to initialize the Tomb, how to do shutdown)
  • [ ] List some candidate modules for the first integration

noamnelke avatar Jul 30 '20 16:07 noamnelke

I'd love to work on this task, as it would be a good opportunity to better familiarize with a high-level code and services' organization.

No goroutine should be started using a bare go keyword.

This could be checked/enforced by a simple linter that analyses code and looks for bare go keywords, perhaps? There are clearly will be exemptions, but if the rules are clear, this can be neatly codified.

divan avatar Aug 03 '20 11:08 divan

Hi @divan, sounds good. Feel free to give this a shot. It would probably be best to propose a plan before beginning implementation work - I'm happy to review it. We'll try to fill in some more details in this issue as well.

lrettig avatar Aug 10 '20 17:08 lrettig

Related: #514

lrettig avatar Aug 27 '20 15:08 lrettig

Related: https://github.com/spacemeshos/go-spacemesh/issues/1639

moshababo avatar Jun 26 '22 12:06 moshababo

@noamnelke still relevant?

moshababo avatar Jun 26 '22 12:06 moshababo

This isn't a task for Noam, I think @dshulyak @countvonzero should weigh in here from a node architecture perspective

lrettig avatar Jun 27 '22 21:06 lrettig

i still agree with the general approach described in the issue. there is just one caveat.

currently in the node we are using golang.org/x/sync/errgroup to manage goroutine. the aforementioned Tomb was attempted and discarded in favor of errgroup.

what remains to be done:

  • update all modules that spawn goroutines to use errgroup
  • use context to manage app shutdown
  • audit the code for panic stmts and see whether they are compatible with graceful shutdown

countvonzero avatar Jun 28 '22 19:06 countvonzero