go-libp2p-swarm icon indicating copy to clipboard operation
go-libp2p-swarm copied to clipboard

send notifications synchronously

Open marten-seemann opened this issue 3 years ago • 3 comments

notifyAll is creating a lot of allocations (and therefore pressure on the GC) by notifying async. This shows up in scenarios where we receive a lot of connections / streams at the same time.

-gcflags=-m confirms why notifyAll is pretty allocation-heavy:

./swarm.go:508:7: leaking param: s
./swarm.go:508:27: leaking param: notify
./swarm.go:514:11: leaking param: f
./swarm.go:509:6: moved to heap: wg
./swarm.go:514:6: func literal escapes to heap

Note that there's been some discussion about this a long time ago: https://github.com/libp2p/go-libp2p-swarm/pull/104#issuecomment-465095630. I'd argue though that if there's one misbehaving consumer, we have a problem no matter if we send the notifications sync or async, as notifyAll still waits for all notifications to be processed. The real question therefore is if the cost of spawning the go routines is less than what we gain from handling these notifications handled concurrently.

Thanks to @mvdan for his help debugging this.

marten-seemann avatar Dec 09 '21 04:12 marten-seemann

I'd argue though that if there's one misbehaving consumer, we have a problem no matter if we send the notifications sync or async, as notifyAll still waits for all notifications to be processed.

I agree with that reasoning.

If we are worried about misbehaving consumers, perhaps we should add a context.Context and use a timeout per notification callback. It would still be possible for a callback to ignore the context and hang forever, but hopefully it would be a harder mistake to make.

Thanks to @mvdan for his help debugging this.

For some numerical context on why this is very important; when a libp2p node deals with large amounts of new streams, this particular bit of code accounts for about a third of all allocations, which adds a significant amount of pressure on the runtime:

         .          .    509:func (s *Swarm) notifyAll(notify func(network.Notifiee)) {
     98305      98305    510:   var wg sync.WaitGroup
         .          .    511:
         .      10923    512:   s.notifs.RLock()
         .          .    513:   wg.Add(len(s.notifs.m))
         .          .    514:   for f := range s.notifs.m {
   5051867    5051867    515:           go func(f network.Notifiee) {
         .          .    516:                   defer wg.Done()
         .          .    517:                   notify(f)
         .          .    518:           }(f)
         .          .    519:   }
         .          .    520:
         .      27309    521:   wg.Wait()
         .          .    522:   s.notifs.RUnlock()

The above is from a go tool pprof -alloc_objects spanning 10s. That goroutine allocates ~5M objects, making it the number 1 cause of allocs by a large difference. The next entries in the top 10 allocate in the order of tens to hundreds of thousands of objects, orders of magnitude below.

mvdan avatar Dec 09 '21 10:12 mvdan

These arguments are quite convincing, and i agree pretty much.

Still, this is something steven might feel very strongly about, so lets wait on him.

vyzo avatar Dec 09 '21 10:12 vyzo

For sure, happy to wait for the original authors of the code :) Just wanted to add a bit more context.

mvdan avatar Dec 09 '21 10:12 mvdan