errgroup
errgroup copied to clipboard
Deadlock when calling Wait and Go from different goroutines
Hi!
I'm using 0.1.5 with Go 1.16.3.
First, thank you for making the package! It's a great idea!
As to the issue: I've encountered what I believe is a deadlock due to my use of this package.
My program (a daemon) does this, among other things:
- Creates an errgroup
- Calls Go() on it
- This goroutine (goroutine A) periodically receives some work to do and then starts worker goroutines, also with Go() on the same errgroup
Periodically my program shuts down. As part of that, it calls Wait() on the errgroup.
The situation I have observed is this:
- The program needs to shut down
- We acquire the lock on the mutex in Wait() (this is in the main goroutine) (https://github.com/neilotoole/errgroup/blob/master/errgroup.go#L89-L99)
- We Wait on the WaitGroup
- Concurrently, the first new goroutine (goroutine A) goes to start a worker. It ends up in Go() trying to lock the mutex too (https://github.com/neilotoole/errgroup/blob/master/errgroup.go#L122)
- The program never exits because we're waiting for goroutine A to exit, but it can't proceed
Stack traces:
goroutine 1 [semacquire, 68 minutes]:
sync.runtime_Semacquire(0xc00055ed30)
/usr/local/go/src/runtime/sema.go:56 +0x45
sync.(*WaitGroup).Wait(0xc00055ed28)
/usr/local/go/src/sync/waitgroup.go:130 +0x65
github.com/neilotoole/errgroup.(*Group).Wait(0xc00055ed20, 0xc000136008, 0xc000609d18)
/home/wstorey_maxmind_com/go/pkg/mod/github.com/neilotoole/[email protected]/errgroup.go:99 +0x65
github.maxmind.com/maxmind/mm_website/go/shared/pubsubworker.Run(0x129a780, 0xc000543e40, 0xc0005977e0, 0x129d0f8, 0xc000560320, 0x1, 0x12869c0, 0xc000136670, 0x0, 0x0)
/home/wstorey_maxmind_com/mm_website/go/shared/pubsubworker/worker.go:121 +0x465
github.maxmind.com/maxmind/mm_website/go/process-job-pubsub/worker.DaemonFromCLI(0x129a780, 0xc00011a000)
/home/wstorey_maxmind_com/mm_website/go/process-job-pubsub/worker/worker.go:158 +0x7a5
main.main()
/home/wstorey_maxmind_com/mm_website/go/process-job-pubsub/main.go:10 +0x39
goroutine 41 [semacquire, 68 minutes]:
sync.runtime_SemacquireMutex(0xc00055ed6c, 0x0, 0x1)
/usr/local/go/src/runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc00055ed68)
/usr/local/go/src/sync/mutex.go:138 +0x105
sync.(*Mutex).Lock(...)
/usr/local/go/src/sync/mutex.go:81
github.com/neilotoole/errgroup.(*Group).Go(0xc00055ed20, 0xc000769880)
/home/wstorey_maxmind_com/go/pkg/mod/github.com/neilotoole/[email protected]/errgroup.go:122 +0x190
github.maxmind.com/maxmind/mm_website/go/shared/pubsubworker.(*Worker).process(0xc000547b30, 0x129a748, 0xc000543e40, 0xc00055ed20)
/home/wstorey_maxmind_com/mm_website/go/shared/pubsubworker/worker.go:190 +0x155
What do you think?
@horgh Thanks for the kind words, much appreciated. It's great to see code in action.
I'll freely admit I had not considered (or tested for) this scenario. It's very interesting.
Just running through your example in my head, yes, I see the deadlock.
It seems like it should be possible to distill this situation down to a test case, and easily arrive at that deadlock (I haven't done that yet, basically due to lack of time).
I suppose the question is: if we had that test case, would sync/errgroup
fail at the same deadlock as neilotoole/errgroup
? I think this package should still emulate the behavior of sync/errgroup
(even in the case of deadlock), as the goal is to be a "drop-in" replacement for sync/errgroup
. On the other hand, if sync/errgroup
does not experience deadlock, but this package does, then this package has failed as a "drop-in".
Ideally, could you provide a test case that demonstrates the deadlock? Either which way, I want to get to the bottom of this scenario. It's a real-world situation that at a minimum needs to be documented, but hopefully solved.
Thank you for looking at it!
This reproduces the issue:
package main
import (
"context"
"log"
"time"
"github.com/neilotoole/errgroup"
// "golang.org/x/sync/errgroup"
)
func main() {
ctx := context.Background()
eg, _ := errgroup.WithContext(ctx)
eg.Go(func() error { return g1(eg) })
if err := eg.Wait(); err != nil {
log.Fatal(err)
}
}
func g1(eg *errgroup.Group) error {
time.Sleep(100 * time.Millisecond)
eg.Go(func() error { return nil })
return nil
}
When run this shows: fatal error: all goroutines are asleep - deadlock!
It does not happen with golang.org/x/sync/errgroup
. It looks like that implementation doesn't use a mutex directly so it is avoiding the situation.
I added a test here in the test package too - https://github.com/neilotoole/errgroup/compare/master...horgh:horgh/deadlock-test?expand=1
I'm kinda wondering now if golang.org/x/sync/errgroup
is not designed for this either. It won't deadlock, but I don't know that it is valid to call Go()
and Wait()
on a Group
from it concurrently. I think we could end up with a race where we still start a goroutine despite Wait()
returning. It won't deadlock so its failure mode is better, but it's not correct either. Perhaps this should simply be documented or something. This comment made me think about this.