go
go copied to clipboard
proposal: x/sync/errgroup: propagate panics and Goexits through Wait
Background
The handling of panics and calls to runtime.Goexit
in x/sync/errgroup
has come up several times in its history:
- #40484 asked why we don't recover panics in goroutines.
- #49802 proposed a separate
panicgroup
API to propagate or handle panics -
CL 131815 pointed out that calls to
t.Fatal
and/ort.Skip
within aGroup
in a test will generally result in either a hard-to-diagnose deadlock or an awkward half-aborted test, instead of skipping or failing the test immediately as expected.- (Compare #15758, #3800.)
- In my GopherCon 2018 talk, “Rethinking Classical Concurrency Patterns”, I recommended that API authors “[m]ake concurrency an internal detail.” In multiple discussions after the talk, folks asked me how to handle panics in goroutines, and I realized that making concurrency an internal detail requires that we propagate panics (and
runtime.Goexit
calls) back to the caller's goroutine. (Otherwise, a concurrent call that panics would terminate the program, while a sequential call that panics would be recoverable!)
Proposal
I propose that:
-
The
(*Group).Wait
method should continue to wait for all goroutines in the group to exit, However, once that condition is met, if any of the goroutines in the group terminated with an unrecoveredpanic
,Wait
should panic with a value wrapping the first panic-value recovered from a goroutine in the group. Otherwise, if any of the goroutines exited viaruntime.Goexit
Wait
should invokeruntime.Goexit
on its own goroutine.- Because the runtime does not support saving and restoring the stack trace of a recovered panic, the value passed to
panic
byWait
should include a best-effort stack dump for the goroutine that initiated the panic. - Because some packages may use
recover
for error-handling (despite our advice to the contrary), if the recovered value implements theerror
interface, the value passed topanic
byWait
should also implement theerror
interface, and should wrap the recovered error (so that it can be retrieved byerrors.Unwrap
).
- Because the runtime does not support saving and restoring the stack trace of a recovered panic, the value passed to
-
The
Context
value returned byerrgroup.WithContext
should be canceled as soon as any function call in the group returns a non-nil error, panics, or exits viaruntime.Goexit
.- All of these conditions indicate that
Wait
has an abnormal status to report, and thus should shut down all work associated with theGroup
so that the abnormal status can be reported quickly.
- All of these conditions indicate that
Specifically, if Wait
panics, the panic-value would have either type PanicValue
or type PanicError
, defined as follows:
// A PanicError wraps an error recovered from an unhandled panic
// when calling a function passed to Go or TryGo.
type PanicError struct {
Recovered error
Stack []byte
}
func (p PanicError) Error() string {
// A Go Error method conventionally does not include a stack dump, so omit it
// here. (Callers who care can extract it from the Stack field.)
return fmt.Sprintf("recovered from errgroup.Group: %v", p.Recovered)
}
func (p PanicError) Unwrap() error { return p.Recovered }
// A PanicValue wraps a value that does not implement the error interface,
// recovered from an unhandled panic when calling a function passed to Go or
// TryGo.
type PanicValue struct {
Recovered interface{}
Stack []byte
}
func (p PanicValue) String() string {
if len(p.Stack) > 0 {
return fmt.Sprintf("recovered from errgroup.Group: %v\n%s", p.Recovered, p.Stack)
}
return fmt.Sprintf("recovered from errgroup.Group: %v", p.Recovered)
}
Compatibility
Any program that today initiates an unrecovered panic
within a Go
or TryGo
callback terminates due to that unrecovered panic, So recovering and propagating such a panic
can only change broken programs into non-broken ones; it cannot break any program that was not already broken.
A valid program could in theory call runtime.Goexit
from within a Go
callback today. However, the vast majority of calls to runtime.Goexit
are via testing.T
methods, and according to the documentation for those methods today they “must be called from the goroutine running the test or benchmark function, not from other goroutines created during the test.” Moreover, it would be possible to implement the documented errgroup.Group
API today in a way that would cause Wait
to always deadlock if runtime.Goexit
were called, so any caller relying on the existing runtime.Goexit
behavior is assuming an implementation detail that is not guaranteed.
In light of the above, I believe that the proposed changes are backward-compatible.
Change https://go.dev/cl/416555 mentions this issue: errgroup: propagate panics and goexits from goroutines in the group
(CC @neild @rogpeppe @kevinburke from CL 134395.)
Want to try writing out the doc update for Group
, Wait
, and Go
?
See https://go.dev/cl/416555 for a draft. 😁
I'm not sure about the compatibility claim. If the program was exiting due to unrecovered panic before, now it's not, and it might be in a more broken state as it keeps running.
I'd feel better if this was only about panic and not Goexit. Nothing should be using Goexit. Is the problem that errgroup is being used and the functions call t.Fatal or t.FailNow?
This proposal has been added to the active column of the proposals project and will now be reviewed at the weekly proposal review meetings. — rsc for the proposal review group
If the program was exiting due to unrecovered panic before, now it's not, and it might be in a more broken state as it keeps running.
That's true, although in general the program won't continue running for long anyway: if the group has an associated Context
it will be canceled when the panic occurs, and most errgroup
uses should either perform bounded work (in which case the work done after the panic likely could have been plausibly scheduled before the panic anyway) or be responsive to cancellation.
I suppose that one alternative would be to propagate the panic to Wait
as soon as it occurs instead of waiting for the other goroutines in the group to complete, but then if that panic is itself recovered that would leave the remaining goroutines running in an orphaned state. That seems even more dangerous to me.
Nothing should be using Goexit. Is the problem that errgroup is being used and the functions call t.Fatal or t.FailNow?
Not that it is, but that it isn't — errgroup
could be a really useful way to coordinate, say, goroutines running test-helpers (such as mock servers) in a way that cleanly isolates them between independent tests.
But today errgroup
is only of limited use in tests, because the callbacks passed to errgroup
cannot safely invoke any test-helper that might call Skip
or Fatal
.
It seems plausible, although it also seems plausible that it might break things. It would be nice not to incur the expense of recording the stack separately if that's possible - commented on the CL. Because it is in x/sync it is easier to roll back if we do find that it is a breaking change, so it may be OK to move forward.
Does anyone object to this change?
Based on the discussion above, this proposal seems like a likely accept. — rsc for the proposal review group
There were a couple objections on the original CL 134395, mostly relating to the complexity of the "double defer sandwich" technique it used, which appears to still be used in the new CL. Have those objections been withdrawn?
It is not clear to me how this new proposal differs from the old draft CL.
So recovering and propagating such a panic can only change broken programs into non-broken ones; it cannot break any program that was not already broken.
This depends on the notion of "broken" we use. It is certainly correct if we use "broken" to mean "crashing". But changing a crashing program into a non-crashing (but still broken) one is often worse. It essentially changes a crash-failure model into a byzantine-failure model, which is harder to reason about.
I don't know how often this will be relevant in practice. But as staunch a defender of "panic should crash the program" I feel a bit queasy. I would not be happy if I ever happened on code which I intended to crash which instead continues to run in a broken state.
That's true, although in general the program won't continue running for long anyway:
I disagree with this. If the errgroup is used in an http.Handler
, that handler will recover the panic, if you propagate it from Wait
. Similar for gRPC, I think. Given how frequent this kind of program is, I would argue that probably in most cases, the panic won't terminate the program after this transformation.
@magical, the comments on the original CL are code-review comments. This is a proposal, and part of the point of the proposal process is to resolve the sorts of design questions raised on that CL.
@Merovius, I can't reasonably consider a Go program that exits by unhandled-panic to be anything other than “broken”.
That said, I agree that it's important for Go users to be able to diagnose broken programs, because we all can and do make mistakes. And you're correct that propagating panics in this way would convert some (unknown) fraction of programs that currently fail by unhandled-panic to instead fail by deadlock, which I agree is often harder to diagnose.
Given how frequent this kind of program is, I would argue that probably in most cases, the panic won't terminate the program after this transformation.
Under this proposal, the panic
would potentially be delayed in time but (unlike in the case of fmt
and net/http
) not permanently swallowed unless something upstream of the errgroup
swallows it. So the behavior of the panic
would still be about as reliable as panic
ever is in a Go program: it would still terminate the program provided that the goroutines in the group eventually exit, and today those other goroutines can already continue running while the panic is happening (they just tend not to do so for long).
Some subset of programs that would have terminated by panic would now recover from that panic (due to a recover
in some upstream caller) and continue to run with the failure successfully contained. Some subset would now terminate by a different path (for example, a recover()
and log.Fatal
at the root of the handler's call graph instead of an unhandled panic in a leaf call). And some subset would deadlock or leak resources for longer than they previously would have.
However, it's worth pointing out that all of those subsets are contained within the broader set of “programs affected by unexpected-panic bugs”, and — especially now that we have native fuzzing as of Go 1.18! — those unexpected-panic bugs are themselves tractable to uncover and fix through testing.
It's also worth noting that the only thing that could hold up the panic
from propagating is some goroutine in the group that is either blocked or still actively running, and in both of those cases that goroutine will continue to show up in a goroutine dump. That is markedly different from, say, the sort of deadlock that you get when a panic
occurs unexpectedly before a call to (*sync.Mutex).Unlock
, where the goroutine that was holding the lock disappears entirely.
@rsc and I discussed the implementation a bit last week, and we think it's possible to structure the panic-recovery such that the panicking goroutine remains alive (but blocked) until the panic has propagated to Wait
. That would make deadlocks even more straightforward to diagnose, because a goroutine dump of a deadlocked program would show the panicking goroutine's stack. Would that help to address your concerns?
While I agree that this change is quite desirable, I'm also not convinced by the backward compatibility of such a change.
Nowhere in the errgroup docs we say that Wait()
must be called. Under the current proposal, AFAICT, the panic would be silently swallowed if Wait()
was never called.
One could argue that not calling Wait()
is a weird use of errgroup, and I would agree. But even if we overlook that specific problem, there is nothing that says that Wait()
must be called "soon", so for example consider:
func someWorker(ch <-chan Something) error {
eg, ctx := errgroup.WithContext(context.Background())
for _, s := range ch {
eg.Go(s.Fn(ctx))
}
return eg.Wait()
}
This usage is perfectly legitimate but would delay the panic by a potentially unbounded amount of time (when the channel is closed, that could happen e.g. when the worker is shutting down gracefully at process exit). We could say that not just Wait
but also Go
and TryGo
should panic in this case... but it's still a bit icky because there's no guarantee that new items will arrive on the channel (and that therefore Go
/TryGo
are called).
As I wrote, I am in almost complete agreement this would be desirable - maybe in a v2 of the library? or as an opt-in behavior in the current version? - but in v1 I would argue changing the default behavior may silently break valid assumptions made by users in potentially very dangerous ways.
@bcmills To me, while both "concurrently running goroutines will continue to execute while a panic unwinds the stack" and "the program will continue to run when something upstack from Wait
recovers the panic" both create a situation where a program runs in an undefined state for some amount of time, the quantitative difference between these two is large enough to be a qualitative one.
I'm not vehemently opposed to this change. I think the cases where it would lead to real, practical problems are very rare. But I don't think the change in behavior can just be swiped away as "eh, it's broken anyways".
Moreover, I'm personally not convinced that this is a good change in and off itself.
CL 131815 pointed out that calls to
t.Fatal
and/ort.Skip
within aGroup
in a test will generally result in either a hard-to-diagnose deadlock or an awkward half-aborted test, instead of skipping or failing the test immediately as expected.
I would argue that this is simply observing that calling t.Fatal
from a goroutine is a bug and that this bug is hard to diagnose in some cases. The fact that it is a bug, is called out in the docs.
This change would make it not-a-bug in some cases, but we certainly can't make it not-a-bug in all cases. So the restriction would have to stay in the docs and people will continue to have to be aware of it. Arguably, to someone who is aware of it, t.Skip
working when done in an errgroup
is making the program behave less as expected, not more.
In my GopherCon 2018 talk, “Rethinking Classical Concurrency Patterns”, I recommended that API authors “[m]ake concurrency an internal detail.” In multiple discussions after the talk, folks asked me how to handle panics in goroutines, and I realized that making concurrency an internal detail requires that we propagate panics (and
runtime.Goexit
calls) back to the caller's goroutine. (Otherwise, a concurrent call that panics would terminate the program, while a sequential call that panics would be recoverable!)
I don't understand the scenario of concern.
- Is the scenario that a library author takes a callback (e.g. as an interface) from the user and moves to calling that (multiple times) from an
errgroup
? Because in that case, this change isn't an "internal detail" anyways, concurrent calls can introduce races, so if a callback can be called multiple times concurrently, that has to be part of the API. - Is the scenario that a library author calls library code concurrently and, by doing that change, changes from a recovering program to a crashing program? In this case, yes, the behavior would change, but it would only do so because of a bug in the library. I feel that "maintaining backwards compatibility in the presence of bugs" is a bit of an unrealistic stretch goal.
- Is the scenario that a library user moves from calling the library sequentially to calling it concurrently? In that case, yes, the behavior of the program would change if the library panics, but that seems hardly the concern of the library author and whether concurrency is internal to the library.
- So, the only scenario left I can imagine is that the library author takes a callback and moves it into an
errgroup
, calling it exactly once. That seems a niche scenario. It certainly happens, but ISTM that if the library author is concerned about the behavior of their library in the presence of panics and wants to make the propagating, they can do so byrecover
ing manually, right?
In general, ISTM that there are two different mindsets regarding panic
/recover
in the community: One of "a panic should ~always crash the program" and one of "a panic should ~always be recovered on a best-effort basis". I think the core problem is there is no consistent application of either of these mindsets in the community or the standard library and it leads to the worst situation for both kinds of people. As some libraries work from one mindset and some work from the other, an operator (or the author of an end-user program) can neither rely on panics crashing the program, nor rely on panics being recovered. Neither can a library author use panic
as a reliable signal that something is wrong.
I find that frustrating. And to me, this change muddies the waters even further. But, as I said, I'm not vehemently opposed. If it happens, it happens. Without a consistent decision about which mindset to apply, the waters will stay muddy either way, this change on its own won't really move the situation much in either direction.
Under the current proposal, AFAICT, the panic would be silently swallowed if Wait() was never called.
This is an interesting point. I will leave this as 'likely accept' for another week to allow the discussion to see if we can come to an agreement about how much this matters. Thanks.
I support the addition of the "recover" behavior to the errgroup as an option.
if it was set, we can replace f()
with recoveredFn(f)()
func recoveredFn(f func() error) func() error {
return func() (err error) {
defer func() {
if r := recover(); r != nil {
buf := make([]byte, 64<<10)
buf = buf[:runtime.Stack(buf, false)]
err = fmt.Errorf("errgroup: panic recovered: %s\n%s", r, buf)
}
}()
return f()
}
}
I support the addition of the "recover" behavior to the errgroup as an option.
It seems to me that if the behavior is optional, there won't be many users who enable it.
How would the option look like? Field on errgroup? Context value? Global environment variable? Something else?
Field on the errgroup requires the user to add code at the errgroup call site. If you need to do that, you can also directly wrap the worker function to recover any panics. This option also leaves any third-party code from libraries with the behavior disabled.
You could put a value into context to control the behavior. This would work across library boundaries. However, using context feels like a hack to me as it changes the behavior of the program. Also there are errgroups created without passing context.
Global environment variable would control the behavior for a single run of the program and would work across library boundaries. However, if you want to have an option to decide the behavior per-program, why not make it compile time - fork x/sync and replace the Go module with a custom version of the errgroup?
Moving back to active. I think the Wait issue needs more discussion and an explicit decision.
This proposal has been added to the active column of the proposals project and will now be reviewed at the weekly proposal review meetings. — rsc for the proposal review group
@CAFxX
Nowhere in the errgroup docs we say that
Wait()
must be called. Under the current proposal, AFAICT, the panic would be silently swallowed ifWait()
was never called.
That's true, but until very recently (https://go.dev/cl/405174 / #27837) the only point to using an errgroup
without calling Wait
would be in order to cancel an associated Context
on error. (After that, the other point to using an errgroup
without calling Wait
might be as a simple semaphore.)
even if we overlook that specific problem, there is nothing that says that
Wait()
must be called "soon"
That's true, although I think the range ch
loop is also fairly contrived. A more realistic example involving errgroup.WithContext
should probably stop adding work when the associated Context
is cancelled, which would occur as soon as the panic is detected.
I'd be more worried about the cases that use the zero errgroup.Group
, but that suggests another alternative: since errgroup.WithContext
has a way to report a pending failure to the other workers in the group (by cancelling the Context
), perhaps it should propagate panics — but the zero errgroup.Group
, lacking such a mechanism, should not?
@Merovius
I would argue that this is simply observing that calling
t.Fatal
from a goroutine is a bug and that this bug is hard to diagnose in some cases. The fact that it is a bug, is called out in the docs.
If this proposal is accepted, then I intend to file a separate issue to clarify the testing
documentation. There is no fundamental reason why t.FailNow
and t.SkipNow
and the methods that wrap them should dictate users' goroutine structure: instead, they should describe exactly what they do (mark the test as failed / skipped and then terminate the current goroutine by calling runtime.Goexit
), and perhaps suggest a simple default practice (call Fatal
from the main goroutine), but leave the detailed implications up to the user (that is: make simple uses simple, and complex uses possible).
I don't understand the scenario of concern.
The main motivating scenario is: a library accepts a callback and executes it one or more times sequentially. The library is changed to also do some other concurrent work. The concurrency properties of the callback are otherwise unchanged: all of the happens-before relationships that used to exist continue to do so. Why should the callback need to dictate whether it is called on the “main” goroutine or some other goroutine?
In general, ISTM that there are two different mindsets regarding
panic
/recover
in the community: One of "a panic should ~always crash the program" and one of "a panic should ~always be recovered on a best-effort basis". I think the core problem is there is no consistent application of either of these mindsets in the community or the standard library and it leads to the worst situation for both kinds of people.
I agree. However, I would argue that that conflict leads to a clear conclusion for library authors: a library cannot presume either behavior. A library must not panic in normal operation (because the panic might crash the program), but also must not assume that a panic from (or through) the library will not be recovered (so, for example, it must defer
any unlock operations before executing any user-provided callbacks).
The purpose of this proposal is to make it easier to use errgroup
to avoid making unwarranted assumptions. Today, a library that uses errgroup
implicitly takes the position that “a panic should always crash the program”, and the user has to write a fair bit of boilerplate to obtain any other behavior. However, if this proposal is accepted, then errgroup
will be agnostic to recovery behavior: the panic will neither immediately crash the program nor be permanently recovered and buried, but will instead adopt the recovery behavior of the caller of Wait
, whatever that may be. (The user would still be free to take a different position by calling recover
in the Go
callbacks, but the default behavior would no longer bias toward “crash unrecoverably”.)
Keying off the context doesn't seem like the right way to define semantics here.
Given that x/sync/errgroup is in x and can be rolled back easily, it seems like maybe we should try the change, with the "breakage", and see if anyone runs into problems. If they do, we'll know that we can't move forward with this change. And otherwise, we'll have done it.
Thoughts?
Based on the discussion above, this proposal seems like a likely accept. — rsc for the proposal review group
Under the current proposal, AFAICT, the panic would be silently swallowed if Wait() was never called.
This is an interesting point. I will leave this as 'likely accept' for another week to allow the discussion to see if we can come to an agreement about how much this matters. Thanks.
To at least partially address the "not calling Wait
silently swallows the panic" we could take a page out of the runtime.Pinner
playbook: if Go
/TryGo
recovers a panic, and Wait
is not called before the errgroup is GCed, panic in the finalizer. (Not sure how much of a good idea this would be, just bringing it up for discussion).
This won't help in case the GC is disabled, or in some of the other issues raised (my worker loop example, or its extension to the zero errgroup) but it would at least help in more ordinary cases.
Alternatively we could document that Wait must be called, or the panic will be silently swallowed. And also that the panic will be delayed until the Wait returns, so new work submitted may start executing even if a work item has already paniced.
I'd be more worried about the cases that use the zero errgroup.Group, but that suggests another alternative: since errgroup.WithContext has a way to report a pending failure to the other workers in the group (by cancelling the Context), perhaps it should propagate panics — but the zero errgroup.Group, lacking such a mechanism, should not?
For the zero errgroup especially I can't say what would be the least surprising behavior, but I would argue that not propagating the panic would be quite surprising.
Let's not drag finalizers in. If this turns out to be a real problem, we should probably think about undoing the change.
@aclements points out that we could also do a vet check for errgroups that are declared but Wait is never called.
No change in consensus, so accepted. 🎉 This issue now tracks the work of implementing the proposal. — rsc for the proposal review group
What's the status of this?
The proposal has been accepted and there is a work-in-progress change at https://go.dev/cl/416555.