snet-daemon
snet-daemon copied to clipboard
Make daemon handle signals (SIGINT, SIGTERM) and correctly cleanup the state
At the moment if service provider kills daemon while it processes client request, then channel remains locked after daemon process is finished. The reason is that Golang
doesn't call deferred functions automatically when process is finished due to signal received.
To handle signals correctly in Golang
it is required to:
- add signal channel and listen for the external signals
- make cleanup calls, close
gRPC
server - gracefully shutdown all goroutines
The sketch of closing gRPC
server on signal can be found at https://github.com/grpc/grpc-go/issues/1372
@vsbogd , just wanted to post the group that raam and I discussed this at length today , will send you a documentation on our approach in a day or two
@vsbogd @raamb
For a graceful Shutdown
-
Add Signal channel and listen for external Signal , This is already done on the Daemon code `` https://github.com/singnet/snet-daemon/blob/v0.1.10/snetd/cmd/serve.go#L66-L70
-
To Stop new requests // GracefulStop stops the gRPC server gracefully. It stops the server from // accepting new connections and RPCs and blocks until all the pending RPCs are // finished. So we just call the GracefulStop() and not stop
-
Existing requests ->Ideal way would be to wait for the existing requests to complete and handle the payment accordingly ?.Will need help here on how best to do this ( with support for streaming), and for later worry about asynch calls when we support that in Daemon
We need to the below to gracefully shutdown go routines
- Signal goroutine to shutdown.
- Wait for goroutine to shutdown.
->Shutdown all goroutines (we don't care if the request will be processed or not ?) , roll back the payment on any channels locked ( in progress)
Check if any channels are locked because a request is currently being processed , then release the lock on these channels ALONE ? We don't want to unlock channels locked for reasons other than the requests being processed , ?
Other way is to use wg := &sync.WaitGroup{}
call wg.Wait() on Shutdown
func cleanUp...(shutdown <-chan int, wg *sync.WaitGroup) {
wg.Add(1) will be called after a channel has been locked wg.Done() will be called on defer function that //calls paymentHandler.Complete(payment) or //paymentHandler.CompleteAfterError(payment, e) //In case of error while unlocking the channel , just log the error ?
If a goroutine needs to launch other goroutines it can simply call wg.Add(1) and pass the *WaitGroup and shutdown channel to it as well. In this way, whenever the shutdown channel is closed, it shuts down all the goroutines. this way the main goroutine doesn't even have to be aware of where each goroutine is started from for this to work.
@anandrgitnirman , thank you for analysis.
As far as I see most of the work we have is done in gRPC requests. We also have HTTP handler and it forwards calls to grpcServer
. So the fast fix is probably to replace
d.grpcServer.Stop()
by
d.grpcServer.GracefulStop()
And wait for gRPC server to be stopped after it.
Thanks @vsbogd , apologies not too sure if I understood on wait for gRPC server to be stopped after it
Just was not sure if GracefulStop blocks until gRPC requests in progress are finished. Documentation says that it is the case. So it can be enough.
Guys, I think we definitely need a test for this use case as it is critical one. Please either reopen it or open new issue for the test.
re opened this , hi @vsbogd , makes sense ,
So before submitting the PR did the following Scenario 1
- Start Daemon
- Fire a client request ( long running one from the service) ,
- Make sure the channel is locked
- Now issue a SIGINT/SIGTERM
- Check if the channel locks are released on the defer function that does this.
We will get this validated by a few other developers and see if it works well for them , once confirmed we can then close this then maybe ?
I suggest adding a unit test to check this on building stage.