errgroup Deadlock when calling Wait and Go from different goroutines

Hi!

I'm using 0.1.5 with Go 1.16.3.

First, thank you for making the package! It's a great idea!

As to the issue: I've encountered what I believe is a deadlock due to my use of this package.

My program (a daemon) does this, among other things:

Creates an errgroup
Calls Go() on it
This goroutine (goroutine A) periodically receives some work to do and then starts worker goroutines, also with Go() on the same errgroup

Periodically my program shuts down. As part of that, it calls Wait() on the errgroup.

The situation I have observed is this:

The program needs to shut down
We acquire the lock on the mutex in Wait() (this is in the main goroutine) (https://github.com/neilotoole/errgroup/blob/master/errgroup.go#L89-L99)
We Wait on the WaitGroup
Concurrently, the first new goroutine (goroutine A) goes to start a worker. It ends up in Go() trying to lock the mutex too (https://github.com/neilotoole/errgroup/blob/master/errgroup.go#L122)
The program never exits because we're waiting for goroutine A to exit, but it can't proceed

Stack traces:

goroutine 1 [semacquire, 68 minutes]:                                                                                                                                   
sync.runtime_Semacquire(0xc00055ed30)                                                                                                                                   
        /usr/local/go/src/runtime/sema.go:56 +0x45                                                                                                                      
sync.(*WaitGroup).Wait(0xc00055ed28)                                                                                                                                    
        /usr/local/go/src/sync/waitgroup.go:130 +0x65                                                                                                                   
github.com/neilotoole/errgroup.(*Group).Wait(0xc00055ed20, 0xc000136008, 0xc000609d18)                                                                                  
        /home/wstorey_maxmind_com/go/pkg/mod/github.com/neilotoole/[email protected]/errgroup.go:99 +0x65                                                                 
github.maxmind.com/maxmind/mm_website/go/shared/pubsubworker.Run(0x129a780, 0xc000543e40, 0xc0005977e0, 0x129d0f8, 0xc000560320, 0x1, 0x12869c0, 0xc000136670, 0x0, 0x0)
        /home/wstorey_maxmind_com/mm_website/go/shared/pubsubworker/worker.go:121 +0x465                                                                                
github.maxmind.com/maxmind/mm_website/go/process-job-pubsub/worker.DaemonFromCLI(0x129a780, 0xc00011a000)                                                               
        /home/wstorey_maxmind_com/mm_website/go/process-job-pubsub/worker/worker.go:158 +0x7a5                                                                          
main.main()                                                                                                                                                             
        /home/wstorey_maxmind_com/mm_website/go/process-job-pubsub/main.go:10 +0x39                                                                                     
                                                                                                                                                                        
                                                                                                                                                                        
goroutine 41 [semacquire, 68 minutes]:                                                                                                                                  
sync.runtime_SemacquireMutex(0xc00055ed6c, 0x0, 0x1)                                                                                                                    
        /usr/local/go/src/runtime/sema.go:71 +0x47                                                                                                                      
sync.(*Mutex).lockSlow(0xc00055ed68)                                                                                                                                    
        /usr/local/go/src/sync/mutex.go:138 +0x105                                                                                                                      
sync.(*Mutex).Lock(...)                                                                                                                                                 
        /usr/local/go/src/sync/mutex.go:81                                                                                                                              
github.com/neilotoole/errgroup.(*Group).Go(0xc00055ed20, 0xc000769880)                                                                                                  
        /home/wstorey_maxmind_com/go/pkg/mod/github.com/neilotoole/[email protected]/errgroup.go:122 +0x190                                                               
github.maxmind.com/maxmind/mm_website/go/shared/pubsubworker.(*Worker).process(0xc000547b30, 0x129a748, 0xc000543e40, 0xc00055ed20)                                     
        /home/wstorey_maxmind_com/mm_website/go/shared/pubsubworker/worker.go:190 +0x155

What do you think?

Apr 20 '21 18:04 horgh

@horgh Thanks for the kind words, much appreciated. It's great to see code in action.

I'll freely admit I had not considered (or tested for) this scenario. It's very interesting.

Just running through your example in my head, yes, I see the deadlock.

It seems like it should be possible to distill this situation down to a test case, and easily arrive at that deadlock (I haven't done that yet, basically due to lack of time).

I suppose the question is: if we had that test case, would sync/errgroup fail at the same deadlock as neilotoole/errgroup? I think this package should still emulate the behavior of sync/errgroup (even in the case of deadlock), as the goal is to be a "drop-in" replacement for sync/errgroup. On the other hand, if sync/errgroup does not experience deadlock, but this package does, then this package has failed as a "drop-in".

Ideally, could you provide a test case that demonstrates the deadlock? Either which way, I want to get to the bottom of this scenario. It's a real-world situation that at a minimum needs to be documented, but hopefully solved.

Apr 22 '21 06:04 neilotoole

Thank you for looking at it!

This reproduces the issue:

package main

import (
	"context"
	"log"
	"time"

	"github.com/neilotoole/errgroup"
	// "golang.org/x/sync/errgroup"
)

func main() {
	ctx := context.Background()

	eg, _ := errgroup.WithContext(ctx)
	eg.Go(func() error { return g1(eg) })

	if err := eg.Wait(); err != nil {
		log.Fatal(err)
	}
}

func g1(eg *errgroup.Group) error {
	time.Sleep(100 * time.Millisecond)
	eg.Go(func() error { return nil })
	return nil
}

When run this shows: fatal error: all goroutines are asleep - deadlock!

It does not happen with golang.org/x/sync/errgroup. It looks like that implementation doesn't use a mutex directly so it is avoiding the situation.

Apr 22 '21 19:04 horgh

I added a test here in the test package too - https://github.com/neilotoole/errgroup/compare/master...horgh:horgh/deadlock-test?expand=1

Apr 22 '21 19:04 horgh

I'm kinda wondering now if golang.org/x/sync/errgroup is not designed for this either. It won't deadlock, but I don't know that it is valid to call Go() and Wait() on a Group from it concurrently. I think we could end up with a race where we still start a goroutine despite Wait() returning. It won't deadlock so its failure mode is better, but it's not correct either. Perhaps this should simply be documented or something. This comment made me think about this.

May 06 '21 22:05 horgh

errgroup errgroup copied to clipboard

Deadlock when calling Wait and Go from different goroutines

errgroup
errgroup copied to clipboard