netlink icon indicating copy to clipboard operation
netlink copied to clipboard

*Subscribe* family of functions leaks goroutines

Open jncornett opened this issue 5 years ago • 5 comments

Contingent on how long it takes to actually start reading from the socket fd, the following code will never return:

package main

import (
	"time"

	"github.com/vishvananda/netlink"
)

func main() {
	updates := make(chan netlink.AddrUpdate)
	done := make(chan struct{})
	netlink.AddrSubscribe(updates, done)
	time.Sleep(time.Second)
	close(done)
	<-updates
}

This is apparently because:

  1. there is no cancellation mechanism in the call to s.Receive()1.
  2. the initialization code does not set a read timeout for the socket (via, e.g. setsockopt).

This isn't a big deal for short-lived programs (as the goroutines will get cleaned up on exit), but I was writing a long lived daemon that creates and removes subscriptions based on configuration file updates. With the current behavior, it is not possible to write such a program without leaking goroutines via the above mechanism.

jncornett avatar May 25 '20 16:05 jncornett

I don't have the context to know if this makes sense for all use cases, but a simple fix would be to duplicate nl.NetlinkSocket.Receive to a new method nl.NetlinkSocket.ReceiveWithTimeout. Inside of this new method set a userspace timeout via unix.Select(...) immediately prior to the call to unix.Recvfrom(...). The *Subscribe* family of functions could use this new ReceiveWithTimeout method in place of Receive. If the return value of the error is Timeout (I know, checking error return values is anti-pattern), then the loop would just continue. As far as I know, this seems like the least invasive change to "cancel" a recvfrom()/recvmsg() call.

jncornett avatar May 25 '20 17:05 jncornett

unix.RecvFrom() can be unblocked by calling unix.Shutdown() with unix.SHUT_RD (or unix.SHUT_RDWR if you also want to unblock unix.SendTo()).

vincentbernat avatar Jun 26 '22 20:06 vincentbernat

I didn't test, but I think this patch would help:

diff --git a/nl/nl_linux.go b/nl/nl_linux.go
index 600b942b1785..72104b93ef2b 100644
--- a/nl/nl_linux.go
+++ b/nl/nl_linux.go
@@ -734,6 +734,7 @@ func SubscribeAt(newNs, curNs netns.NsHandle, protocol int, groups ...uint) (*Ne

 func (s *NetlinkSocket) Close() {
        fd := int(atomic.SwapInt32(&s.fd, -1))
+       unix.Shutdown(fd, unix.SHUT_RDWR)
        unix.Close(fd)
 }

In fact, it seems this is not even needed on recent kernel anymore. Just close(done) seems to do the trick.

vincentbernat avatar Jun 26 '22 20:06 vincentbernat

I noticed the same issue on Fedora 35 with kernel 5.18.6-100.fc35.x86_64. Shouldn't that be recent enough?

Hendrik-H avatar Jul 09 '22 17:07 Hendrik-H

SHUT_RDWR netlink return operation not supported

method of setting recv timeout is also ok

image image

before edit, seeing that some goroutines did not exit

image

after edit, look good

image

maxiloEmmmm avatar Nov 15 '23 09:11 maxiloEmmmm