go proposal: x/sync/errgroup: propagate panics and Goexits through Wait

Background

The handling of panics and calls to runtime.Goexit in x/sync/errgroup has come up several times in its history:

#40484 asked why we don't recover panics in goroutines.
#49802 proposed a separate panicgroup API to propagate or handle panics
CL 131815 pointed out that calls to t.Fatal and/or t.Skip within a Group in a test will generally result in either a hard-to-diagnose deadlock or an awkward half-aborted test, instead of skipping or failing the test immediately as expected.
- (Compare #15758, #3800.)
In my GopherCon 2018 talk, “Rethinking Classical Concurrency Patterns”, I recommended that API authors “[m]ake concurrency an internal detail.” In multiple discussions after the talk, folks asked me how to handle panics in goroutines, and I realized that making concurrency an internal detail requires that we propagate panics (and runtime.Goexit calls) back to the caller's goroutine. (Otherwise, a concurrent call that panics would terminate the program, while a sequential call that panics would be recoverable!)

Proposal

I propose that:

The (*Group).Wait method should continue to wait for all goroutines in the group to exit, However, once that condition is met, if any of the goroutines in the group terminated with an unrecovered panic, Wait should panic with a value wrapping the first panic-value recovered from a goroutine in the group. Otherwise, if any of the goroutines exited via runtime.Goexit Wait should invoke runtime.Goexit on its own goroutine.
- Because the runtime does not support saving and restoring the stack trace of a recovered panic, the value passed to panic by Wait should include a best-effort stack dump for the goroutine that initiated the panic.
- Because some packages may use recover for error-handling (despite our advice to the contrary), if the recovered value implements the error interface, the value passed to panic by Wait should also implement the error interface, and should wrap the recovered error (so that it can be retrieved by errors.Unwrap).
The Context value returned by errgroup.WithContext should be canceled as soon as any function call in the group returns a non-nil error, panics, or exits via runtime.Goexit.
- All of these conditions indicate that Wait has an abnormal status to report, and thus should shut down all work associated with the Group so that the abnormal status can be reported quickly.

Specifically, if Wait panics, the panic-value would have either type PanicValue or type PanicError, defined as follows:

// A PanicError wraps an error recovered from an unhandled panic
// when calling a function passed to Go or TryGo.
type PanicError struct {
	Recovered error
	Stack     []byte
}

func (p PanicError) Error() string {
	// A Go Error method conventionally does not include a stack dump, so omit it
	// here. (Callers who care can extract it from the Stack field.)
	return fmt.Sprintf("recovered from errgroup.Group: %v", p.Recovered)
}

func (p PanicError) Unwrap() error { return p.Recovered }

// A PanicValue wraps a value that does not implement the error interface,
// recovered from an unhandled panic when calling a function passed to Go or
// TryGo.
type PanicValue struct {
	Recovered interface{}
	Stack     []byte
}

func (p PanicValue) String() string {
	if len(p.Stack) > 0 {
		return fmt.Sprintf("recovered from errgroup.Group: %v\n%s", p.Recovered, p.Stack)
	}
	return fmt.Sprintf("recovered from errgroup.Group: %v", p.Recovered)
}

Compatibility

Any program that today initiates an unrecovered panic within a Go or TryGo callback terminates due to that unrecovered panic, So recovering and propagating such a panic can only change broken programs into non-broken ones; it cannot break any program that was not already broken.

A valid program could in theory call runtime.Goexit from within a Go callback today. However, the vast majority of calls to runtime.Goexit are via testing.T methods, and according to the documentation for those methods today they “must be called from the goroutine running the test or benchmark function, not from other goroutines created during the test.” Moreover, it would be possible to implement the documented errgroup.Group API today in a way that would cause Wait to always deadlock if runtime.Goexit were called, so any caller relying on the existing runtime.Goexit behavior is assuming an implementation detail that is not guaranteed.

In light of the above, I believe that the proposed changes are backward-compatible.

Jul 08 '22 18:07 bcmills

Change https://go.dev/cl/416555 mentions this issue: errgroup: propagate panics and goexits from goroutines in the group

Jul 08 '22 18:07 gopherbot

(CC @neild @rogpeppe @kevinburke from CL 134395.)

Jul 08 '22 18:07 bcmills

Want to try writing out the doc update for Group, Wait, and Go?

Jul 08 '22 18:07 ianlancetaylor

See https://go.dev/cl/416555 for a draft. 😁

Jul 08 '22 18:07 bcmills

I'm not sure about the compatibility claim. If the program was exiting due to unrecovered panic before, now it's not, and it might be in a more broken state as it keeps running.

I'd feel better if this was only about panic and not Goexit. Nothing should be using Goexit. Is the problem that errgroup is being used and the functions call t.Fatal or t.FailNow?

Jul 13 '22 17:07 rsc

This proposal has been added to the active column of the proposals project and will now be reviewed at the weekly proposal review meetings. — rsc for the proposal review group

Jul 13 '22 18:07 rsc

If the program was exiting due to unrecovered panic before, now it's not, and it might be in a more broken state as it keeps running.

That's true, although in general the program won't continue running for long anyway: if the group has an associated Context it will be canceled when the panic occurs, and most errgroup uses should either perform bounded work (in which case the work done after the panic likely could have been plausibly scheduled before the panic anyway) or be responsive to cancellation.

I suppose that one alternative would be to propagate the panic to Wait as soon as it occurs instead of waiting for the other goroutines in the group to complete, but then if that panic is itself recovered that would leave the remaining goroutines running in an orphaned state. That seems even more dangerous to me.

Jul 13 '22 20:07 bcmills

Nothing should be using Goexit. Is the problem that errgroup is being used and the functions call t.Fatal or t.FailNow?

Not that it is, but that it isn't — errgroup could be a really useful way to coordinate, say, goroutines running test-helpers (such as mock servers) in a way that cleanly isolates them between independent tests.

But today errgroup is only of limited use in tests, because the callbacks passed to errgroup cannot safely invoke any test-helper that might call Skip or Fatal.

Jul 13 '22 20:07 bcmills

It seems plausible, although it also seems plausible that it might break things. It would be nice not to incur the expense of recording the stack separately if that's possible - commented on the CL. Because it is in x/sync it is easier to roll back if we do find that it is a breaking change, so it may be OK to move forward.

Does anyone object to this change?

Jul 20 '22 17:07 rsc

Based on the discussion above, this proposal seems like a likely accept. — rsc for the proposal review group

Jul 27 '22 18:07 rsc

There were a couple objections on the original CL 134395, mostly relating to the complexity of the "double defer sandwich" technique it used, which appears to still be used in the new CL. Have those objections been withdrawn?

It is not clear to me how this new proposal differs from the old draft CL.

Jul 27 '22 19:07 magical

So recovering and propagating such a panic can only change broken programs into non-broken ones; it cannot break any program that was not already broken.

This depends on the notion of "broken" we use. It is certainly correct if we use "broken" to mean "crashing". But changing a crashing program into a non-crashing (but still broken) one is often worse. It essentially changes a crash-failure model into a byzantine-failure model, which is harder to reason about.

I don't know how often this will be relevant in practice. But as staunch a defender of "panic should crash the program" I feel a bit queasy. I would not be happy if I ever happened on code which I intended to crash which instead continues to run in a broken state.

That's true, although in general the program won't continue running for long anyway:

I disagree with this. If the errgroup is used in an http.Handler, that handler will recover the panic, if you propagate it from Wait. Similar for gRPC, I think. Given how frequent this kind of program is, I would argue that probably in most cases, the panic won't terminate the program after this transformation.

Jul 27 '22 21:07 Merovius

@magical, the comments on the original CL are code-review comments. This is a proposal, and part of the point of the proposal process is to resolve the sorts of design questions raised on that CL.

Jul 28 '22 20:07 bcmills

@Merovius, I can't reasonably consider a Go program that exits by unhandled-panic to be anything other than “broken”.

That said, I agree that it's important for Go users to be able to diagnose broken programs, because we all can and do make mistakes. And you're correct that propagating panics in this way would convert some (unknown) fraction of programs that currently fail by unhandled-panic to instead fail by deadlock, which I agree is often harder to diagnose.

Given how frequent this kind of program is, I would argue that probably in most cases, the panic won't terminate the program after this transformation.

Under this proposal, the panic would potentially be delayed in time but (unlike in the case of fmt and net/http) not permanently swallowed unless something upstream of the errgroup swallows it. So the behavior of the panic would still be about as reliable as panic ever is in a Go program: it would still terminate the program provided that the goroutines in the group eventually exit, and today those other goroutines can already continue running while the panic is happening (they just tend not to do so for long).

Some subset of programs that would have terminated by panic would now recover from that panic (due to a recover in some upstream caller) and continue to run with the failure successfully contained. Some subset would now terminate by a different path (for example, a recover() and log.Fatal at the root of the handler's call graph instead of an unhandled panic in a leaf call). And some subset would deadlock or leak resources for longer than they previously would have.

However, it's worth pointing out that all of those subsets are contained within the broader set of “programs affected by unexpected-panic bugs”, and — especially now that we have native fuzzing as of Go 1.18! — those unexpected-panic bugs are themselves tractable to uncover and fix through testing.

It's also worth noting that the only thing that could hold up the panic from propagating is some goroutine in the group that is either blocked or still actively running, and in both of those cases that goroutine will continue to show up in a goroutine dump. That is markedly different from, say, the sort of deadlock that you get when a panic occurs unexpectedly before a call to (*sync.Mutex).Unlock, where the goroutine that was holding the lock disappears entirely.

@rsc and I discussed the implementation a bit last week, and we think it's possible to structure the panic-recovery such that the panicking goroutine remains alive (but blocked) until the panic has propagated to Wait. That would make deadlocks even more straightforward to diagnose, because a goroutine dump of a deadlocked program would show the panicking goroutine's stack. Would that help to address your concerns?

Jul 28 '22 20:07 bcmills

While I agree that this change is quite desirable, I'm also not convinced by the backward compatibility of such a change.

Nowhere in the errgroup docs we say that Wait() must be called. Under the current proposal, AFAICT, the panic would be silently swallowed if Wait() was never called.

One could argue that not calling Wait() is a weird use of errgroup, and I would agree. But even if we overlook that specific problem, there is nothing that says that Wait() must be called "soon", so for example consider:

func someWorker(ch <-chan Something) error {
   eg, ctx := errgroup.WithContext(context.Background())
   for _, s := range ch {
     eg.Go(s.Fn(ctx))
   }
   return eg.Wait()
}

This usage is perfectly legitimate but would delay the panic by a potentially unbounded amount of time (when the channel is closed, that could happen e.g. when the worker is shutting down gracefully at process exit). We could say that not just Wait but also Go and TryGo should panic in this case... but it's still a bit icky because there's no guarantee that new items will arrive on the channel (and that therefore Go/TryGo are called).

As I wrote, I am in almost complete agreement this would be desirable - maybe in a v2 of the library? or as an opt-in behavior in the current version? - but in v1 I would argue changing the default behavior may silently break valid assumptions made by users in potentially very dangerous ways.

Aug 03 '22 01:08 CAFxX

@bcmills To me, while both "concurrently running goroutines will continue to execute while a panic unwinds the stack" and "the program will continue to run when something upstack from Wait recovers the panic" both create a situation where a program runs in an undefined state for some amount of time, the quantitative difference between these two is large enough to be a qualitative one.

I'm not vehemently opposed to this change. I think the cases where it would lead to real, practical problems are very rare. But I don't think the change in behavior can just be swiped away as "eh, it's broken anyways".

Moreover, I'm personally not convinced that this is a good change in and off itself.

CL 131815 pointed out that calls to t.Fatal and/or t.Skip within a Group in a test will generally result in either a hard-to-diagnose deadlock or an awkward half-aborted test, instead of skipping or failing the test immediately as expected.

I would argue that this is simply observing that calling t.Fatal from a goroutine is a bug and that this bug is hard to diagnose in some cases. The fact that it is a bug, is called out in the docs.

This change would make it not-a-bug in some cases, but we certainly can't make it not-a-bug in all cases. So the restriction would have to stay in the docs and people will continue to have to be aware of it. Arguably, to someone who is aware of it, t.Skip working when done in an errgroup is making the program behave less as expected, not more.

In my GopherCon 2018 talk, “Rethinking Classical Concurrency Patterns”, I recommended that API authors “[m]ake concurrency an internal detail.” In multiple discussions after the talk, folks asked me how to handle panics in goroutines, and I realized that making concurrency an internal detail requires that we propagate panics (and runtime.Goexit calls) back to the caller's goroutine. (Otherwise, a concurrent call that panics would terminate the program, while a sequential call that panics would be recoverable!)

I don't understand the scenario of concern.

Is the scenario that a library author takes a callback (e.g. as an interface) from the user and moves to calling that (multiple times) from an errgroup? Because in that case, this change isn't an "internal detail" anyways, concurrent calls can introduce races, so if a callback can be called multiple times concurrently, that has to be part of the API.
Is the scenario that a library author calls library code concurrently and, by doing that change, changes from a recovering program to a crashing program? In this case, yes, the behavior would change, but it would only do so because of a bug in the library. I feel that "maintaining backwards compatibility in the presence of bugs" is a bit of an unrealistic stretch goal.
Is the scenario that a library user moves from calling the library sequentially to calling it concurrently? In that case, yes, the behavior of the program would change if the library panics, but that seems hardly the concern of the library author and whether concurrency is internal to the library.
So, the only scenario left I can imagine is that the library author takes a callback and moves it into an errgroup, calling it exactly once. That seems a niche scenario. It certainly happens, but ISTM that if the library author is concerned about the behavior of their library in the presence of panics and wants to make the propagating, they can do so by recovering manually, right?

In general, ISTM that there are two different mindsets regarding panic/recover in the community: One of "a panic should ~always crash the program" and one of "a panic should ~always be recovered on a best-effort basis". I think the core problem is there is no consistent application of either of these mindsets in the community or the standard library and it leads to the worst situation for both kinds of people. As some libraries work from one mindset and some work from the other, an operator (or the author of an end-user program) can neither rely on panics crashing the program, nor rely on panics being recovered. Neither can a library author use panic as a reliable signal that something is wrong.

I find that frustrating. And to me, this change muddies the waters even further. But, as I said, I'm not vehemently opposed. If it happens, it happens. Without a consistent decision about which mindset to apply, the waters will stay muddy either way, this change on its own won't really move the situation much in either direction.

Aug 03 '22 05:08 Merovius

Under the current proposal, AFAICT, the panic would be silently swallowed if Wait() was never called.

This is an interesting point. I will leave this as 'likely accept' for another week to allow the discussion to see if we can come to an agreement about how much this matters. Thanks.

Aug 03 '22 17:08 rsc

I support the addition of the "recover" behavior to the errgroup as an option. if it was set, we can replace f() with recoveredFn(f)()

func recoveredFn(f func() error) func() error {
	return func() (err error) {
		defer func() {
			if r := recover(); r != nil {
				buf := make([]byte, 64<<10)
				buf = buf[:runtime.Stack(buf, false)]
				err = fmt.Errorf("errgroup: panic recovered: %s\n%s", r, buf)
			}
		}()
		return f()
	}
}

Aug 08 '22 11:08 chengyayu

I support the addition of the "recover" behavior to the errgroup as an option.

It seems to me that if the behavior is optional, there won't be many users who enable it.

How would the option look like? Field on errgroup? Context value? Global environment variable? Something else?

Field on the errgroup requires the user to add code at the errgroup call site. If you need to do that, you can also directly wrap the worker function to recover any panics. This option also leaves any third-party code from libraries with the behavior disabled.

You could put a value into context to control the behavior. This would work across library boundaries. However, using context feels like a hack to me as it changes the behavior of the program. Also there are errgroups created without passing context.

Global environment variable would control the behavior for a single run of the program and would work across library boundaries. However, if you want to have an option to decide the behavior per-program, why not make it compile time - fork x/sync and replace the Go module with a custom version of the errgroup?

Aug 08 '22 17:08 martin-sucha

Moving back to active. I think the Wait issue needs more discussion and an explicit decision.

Aug 10 '22 17:08 rsc

This proposal has been added to the active column of the proposals project and will now be reviewed at the weekly proposal review meetings. — rsc for the proposal review group

Aug 12 '22 03:08 rsc

@CAFxX

Nowhere in the errgroup docs we say that Wait() must be called. Under the current proposal, AFAICT, the panic would be silently swallowed if Wait() was never called.

That's true, but until very recently (https://go.dev/cl/405174 / #27837) the only point to using an errgroup without calling Wait would be in order to cancel an associated Context on error. (After that, the other point to using an errgroup without calling Wait might be as a simple semaphore.)

even if we overlook that specific problem, there is nothing that says that Wait() must be called "soon"

That's true, although I think the range ch loop is also fairly contrived. A more realistic example involving errgroup.WithContext should probably stop adding work when the associated Context is cancelled, which would occur as soon as the panic is detected.

I'd be more worried about the cases that use the zero errgroup.Group, but that suggests another alternative: since errgroup.WithContext has a way to report a pending failure to the other workers in the group (by cancelling the Context), perhaps it should propagate panics — but the zero errgroup.Group, lacking such a mechanism, should not?

Aug 17 '22 21:08 bcmills

@Merovius

I would argue that this is simply observing that calling t.Fatal from a goroutine is a bug and that this bug is hard to diagnose in some cases. The fact that it is a bug, is called out in the docs.

If this proposal is accepted, then I intend to file a separate issue to clarify the testing documentation. There is no fundamental reason why t.FailNow and t.SkipNow and the methods that wrap them should dictate users' goroutine structure: instead, they should describe exactly what they do (mark the test as failed / skipped and then terminate the current goroutine by calling runtime.Goexit), and perhaps suggest a simple default practice (call Fatal from the main goroutine), but leave the detailed implications up to the user (that is: make simple uses simple, and complex uses possible).

I don't understand the scenario of concern.

The main motivating scenario is: a library accepts a callback and executes it one or more times sequentially. The library is changed to also do some other concurrent work. The concurrency properties of the callback are otherwise unchanged: all of the happens-before relationships that used to exist continue to do so. Why should the callback need to dictate whether it is called on the “main” goroutine or some other goroutine?

In general, ISTM that there are two different mindsets regarding panic/recover in the community: One of "a panic should ~always crash the program" and one of "a panic should ~always be recovered on a best-effort basis". I think the core problem is there is no consistent application of either of these mindsets in the community or the standard library and it leads to the worst situation for both kinds of people.

I agree. However, I would argue that that conflict leads to a clear conclusion for library authors: a library cannot presume either behavior. A library must not panic in normal operation (because the panic might crash the program), but also must not assume that a panic from (or through) the library will not be recovered (so, for example, it must defer any unlock operations before executing any user-provided callbacks).

The purpose of this proposal is to make it easier to use errgroup to avoid making unwarranted assumptions. Today, a library that uses errgroup implicitly takes the position that “a panic should always crash the program”, and the user has to write a fair bit of boilerplate to obtain any other behavior. However, if this proposal is accepted, then errgroup will be agnostic to recovery behavior: the panic will neither immediately crash the program nor be permanently recovered and buried, but will instead adopt the recovery behavior of the caller of Wait, whatever that may be. (The user would still be free to take a different position by calling recover in the Go callbacks, but the default behavior would no longer bias toward “crash unrecoverably”.)

Aug 17 '22 21:08 bcmills

Keying off the context doesn't seem like the right way to define semantics here.

Given that x/sync/errgroup is in x and can be rolled back easily, it seems like maybe we should try the change, with the "breakage", and see if anyone runs into problems. If they do, we'll know that we can't move forward with this change. And otherwise, we'll have done it.

Thoughts?

Aug 31 '22 17:08 rsc

Based on the discussion above, this proposal seems like a likely accept. — rsc for the proposal review group

Sep 07 '22 19:09 rsc

Under the current proposal, AFAICT, the panic would be silently swallowed if Wait() was never called.

This is an interesting point. I will leave this as 'likely accept' for another week to allow the discussion to see if we can come to an agreement about how much this matters. Thanks.

To at least partially address the "not calling Wait silently swallows the panic" we could take a page out of the runtime.Pinner playbook: if Go/TryGo recovers a panic, and Wait is not called before the errgroup is GCed, panic in the finalizer. (Not sure how much of a good idea this would be, just bringing it up for discussion).

This won't help in case the GC is disabled, or in some of the other issues raised (my worker loop example, or its extension to the zero errgroup) but it would at least help in more ordinary cases.

Alternatively we could document that Wait must be called, or the panic will be silently swallowed. And also that the panic will be delayed until the Wait returns, so new work submitted may start executing even if a work item has already paniced.

I'd be more worried about the cases that use the zero errgroup.Group, but that suggests another alternative: since errgroup.WithContext has a way to report a pending failure to the other workers in the group (by cancelling the Context), perhaps it should propagate panics — but the zero errgroup.Group, lacking such a mechanism, should not?

For the zero errgroup especially I can't say what would be the least surprising behavior, but I would argue that not propagating the panic would be quite surprising.

Sep 08 '22 05:09 CAFxX

Let's not drag finalizers in. If this turns out to be a real problem, we should probably think about undoing the change.

@aclements points out that we could also do a vet check for errgroups that are declared but Wait is never called.

Sep 21 '22 17:09 rsc

No change in consensus, so accepted. 🎉 This issue now tracks the work of implementing the proposal. — rsc for the proposal review group

Sep 21 '22 18:09 rsc

What's the status of this?

Aug 06 '23 17:08 emil14

The proposal has been accepted and there is a work-in-progress change at https://go.dev/cl/416555.

Aug 07 '23 05:08 ianlancetaylor