snet-daemon icon indicating copy to clipboard operation
snet-daemon copied to clipboard

Make daemon handle signals (SIGINT, SIGTERM) and correctly cleanup the state

Open vsbogd opened this issue 6 years ago • 9 comments

At the moment if service provider kills daemon while it processes client request, then channel remains locked after daemon process is finished. The reason is that Golang doesn't call deferred functions automatically when process is finished due to signal received.

To handle signals correctly in Golang it is required to:

  • add signal channel and listen for the external signals
  • make cleanup calls, close gRPC server
  • gracefully shutdown all goroutines

The sketch of closing gRPC server on signal can be found at https://github.com/grpc/grpc-go/issues/1372

vsbogd avatar Jan 31 '19 06:01 vsbogd

@vsbogd , just wanted to post the group that raam and I discussed this at length today , will send you a documentation on our approach in a day or two

anandrgitnirman avatar Apr 08 '19 12:04 anandrgitnirman

@vsbogd @raamb

For a graceful Shutdown

  • Add Signal channel and listen for external Signal , This is already done on the Daemon code `` https://github.com/singnet/snet-daemon/blob/v0.1.10/snetd/cmd/serve.go#L66-L70

  • To Stop new requests // GracefulStop stops the gRPC server gracefully. It stops the server from // accepting new connections and RPCs and blocks until all the pending RPCs are // finished. So we just call the GracefulStop() and not stop

  • Existing requests ->Ideal way would be to wait for the existing requests to complete and handle the payment accordingly ?.Will need help here on how best to do this ( with support for streaming), and for later worry about asynch calls when we support that in Daemon

We need to the below to gracefully shutdown go routines

  1. Signal goroutine to shutdown.
  2. Wait for goroutine to shutdown.

->Shutdown all goroutines (we don't care if the request will be processed or not ?) , roll back the payment on any channels locked ( in progress)

Check if any channels are locked because a request is currently being processed , then release the lock on these channels ALONE ? We don't want to unlock channels locked for reasons other than the requests being processed , ?

Other way is to use wg := &sync.WaitGroup{} call wg.Wait() on Shutdown
func cleanUp...(shutdown <-chan int, wg *sync.WaitGroup) {

wg.Add(1) will be called after a channel has been locked wg.Done() will be called on defer function that //calls paymentHandler.Complete(payment) or //paymentHandler.CompleteAfterError(payment, e) //In case of error while unlocking the channel , just log the error ?

If a goroutine needs to launch other goroutines it can simply call wg.Add(1) and pass the *WaitGroup and shutdown channel to it as well. In this way, whenever the shutdown channel is closed, it shuts down all the goroutines. this way the main goroutine doesn't even have to be aware of where each goroutine is started from for this to work.

anandrgitnirman avatar Apr 10 '19 13:04 anandrgitnirman

@anandrgitnirman , thank you for analysis.

As far as I see most of the work we have is done in gRPC requests. We also have HTTP handler and it forwards calls to grpcServer. So the fast fix is probably to replace

d.grpcServer.Stop()

by

d.grpcServer.GracefulStop()

And wait for gRPC server to be stopped after it.

vsbogd avatar Apr 10 '19 16:04 vsbogd

Thanks @vsbogd , apologies not too sure if I understood on wait for gRPC server to be stopped after it

anandrgitnirman avatar Apr 10 '19 17:04 anandrgitnirman

Just was not sure if GracefulStop blocks until gRPC requests in progress are finished. Documentation says that it is the case. So it can be enough.

vsbogd avatar Apr 11 '19 01:04 vsbogd

Guys, I think we definitely need a test for this use case as it is critical one. Please either reopen it or open new issue for the test.

vsbogd avatar Apr 12 '19 19:04 vsbogd

re opened this , hi @vsbogd , makes sense ,

anandrgitnirman avatar Apr 14 '19 03:04 anandrgitnirman

So before submitting the PR did the following Scenario 1

  1. Start Daemon
  2. Fire a client request ( long running one from the service) ,
  3. Make sure the channel is locked
  4. Now issue a SIGINT/SIGTERM
  5. Check if the channel locks are released on the defer function that does this.

We will get this validated by a few other developers and see if it works well for them , once confirmed we can then close this then maybe ?

anandrgitnirman avatar Apr 15 '19 05:04 anandrgitnirman

I suggest adding a unit test to check this on building stage.

vsbogd avatar Apr 16 '19 07:04 vsbogd