nri icon indicating copy to clipboard operation
nri copied to clipboard

fix: Prevent calling the main process exit without WithOnClose()

Open attlee-wang opened this issue 1 year ago • 4 comments

My service is a multi-goroutine daemon service, and nri-plugin is one of the modules of my service. In nri-plugin I did not use WithOnClose() to set the exit callback function, I would rebuild a new stub when exiting. I found that when the stup exception occurred, the daemon service main was actually exited. I felt that it was very unfriendly to exit the deamon service due to a goroutine exception. After checking, it is found that there is a problem with the following code:

// Handle a lost connection.
func (stub *stub) connClosed() {
	stub.Lock()
	stub.close()
	stub.Unlock()
	if stub.onClose != nil {
		stub.onClose()
		return
	}

	os.Exit(0) // exits the process
}

os.Exit(0) exits the process,for the nri module package, using runtime.Goexit() is more elegant. runtime.Goexit() will only exit the current goroutine, not the daemon process.

attlee-wang avatar Sep 10 '24 07:09 attlee-wang

@klihub @mikebrow @fuweid can anyone help to review this ? thanks.

attlee-wang avatar Sep 10 '24 08:09 attlee-wang

My service is a multi-goroutine daemon service, and nri-plugin is one of the modules of my service. In nri-plugin I did not use WithOnClose() to set the exit callback function, I would rebuild a new stub when exiting.

@attlee-wang Not directly related to this PR, but it might be useful for you: ever since we have merged PR #91, it became possible to restart a stub by using a Stop() followed by a Start().

I found that when the stup exception occurred, the daemon service main was actually exited. I felt that it was very unfriendly to exit the deamon service due to a goroutine exception. After checking, it is found that there is a problem with the following code: ...

This behavior has been discussed and criticized earlier a few times, and it is fair to say that not without a reason. I have a few question/comments related to this.

  1. This original, and admittedly a bit harshly/hastily implemented behavior, was partly in place to make pre-connected/runtime-launched plugins exit once they lose their pre-connected transport (a socketpair() which cannot be externally re-established). It would be nice and also fairly easily possible to retain this behavior. We could either save the fact in the original connection setup/adoption code that we took a socketpair and then use that in the connection lost handler, or we could simply re-check the same env. var and use its existence for the same.
  2. Cannot you work around this immediate problem in your current code by simply creating your stub with a WithOnClose() handler to prevent the stub internally doing a os.Exit(0) ? Your handler could either do nothing or, for instance using a channel, signal your upper layer logic to recreate (or Stop()/Start()) the stub, if you are neither running it with stub.Run() nor waiting for its termination with stub.Wait() ?

klihub avatar Sep 10 '24 08:09 klihub

ping @attlee-wang

klihub avatar Sep 11 '24 20:09 klihub

  1. Cannot you work around this immediate problem in your current code by simply creating your stub with a WithOnClose() handler to prevent the stub internally doing a os.Exit(0) ? Your handler could either do nothing or, for instance using a channel, signal your upper layer logic to recreate (or Stop()/Start()) the stub, if you are neither running it with stub.Run() nor waiting for its termination with stub.Wait() ?

@klihub It's not that I don't use WithOnClose(). I looked at the stub.go package and exist stub.Stop(). I subjectively thought that stub.Stop() could be used. After all, WithOnClose() is just an optional function.

attlee-wang avatar Sep 24 '24 12:09 attlee-wang

I think https://github.com/containerd/nri/pull/173 resolves the same problem, but does so in a way that exposes the error to the plugin better. Thanks for your contribution @attlee-wang.

samuelkarp avatar May 23 '25 18:05 samuelkarp