go-libp2p-swarm
go-libp2p-swarm copied to clipboard
send notifications synchronously
notifyAll is creating a lot of allocations (and therefore pressure on the GC) by notifying async. This shows up in scenarios where we receive a lot of connections / streams at the same time.
-gcflags=-m confirms why notifyAll is pretty allocation-heavy:
./swarm.go:508:7: leaking param: s
./swarm.go:508:27: leaking param: notify
./swarm.go:514:11: leaking param: f
./swarm.go:509:6: moved to heap: wg
./swarm.go:514:6: func literal escapes to heap
Note that there's been some discussion about this a long time ago: https://github.com/libp2p/go-libp2p-swarm/pull/104#issuecomment-465095630. I'd argue though that if there's one misbehaving consumer, we have a problem no matter if we send the notifications sync or async, as notifyAll still waits for all notifications to be processed.
The real question therefore is if the cost of spawning the go routines is less than what we gain from handling these notifications handled concurrently.
Thanks to @mvdan for his help debugging this.
I'd argue though that if there's one misbehaving consumer, we have a problem no matter if we send the notifications sync or async, as
notifyAllstill waits for all notifications to be processed.
I agree with that reasoning.
If we are worried about misbehaving consumers, perhaps we should add a context.Context and use a timeout per notification callback. It would still be possible for a callback to ignore the context and hang forever, but hopefully it would be a harder mistake to make.
Thanks to @mvdan for his help debugging this.
For some numerical context on why this is very important; when a libp2p node deals with large amounts of new streams, this particular bit of code accounts for about a third of all allocations, which adds a significant amount of pressure on the runtime:
. . 509:func (s *Swarm) notifyAll(notify func(network.Notifiee)) {
98305 98305 510: var wg sync.WaitGroup
. . 511:
. 10923 512: s.notifs.RLock()
. . 513: wg.Add(len(s.notifs.m))
. . 514: for f := range s.notifs.m {
5051867 5051867 515: go func(f network.Notifiee) {
. . 516: defer wg.Done()
. . 517: notify(f)
. . 518: }(f)
. . 519: }
. . 520:
. 27309 521: wg.Wait()
. . 522: s.notifs.RUnlock()
The above is from a go tool pprof -alloc_objects spanning 10s. That goroutine allocates ~5M objects, making it the number 1 cause of allocs by a large difference. The next entries in the top 10 allocate in the order of tens to hundreds of thousands of objects, orders of magnitude below.
These arguments are quite convincing, and i agree pretty much.
Still, this is something steven might feel very strongly about, so lets wait on him.
For sure, happy to wait for the original authors of the code :) Just wanted to add a bit more context.