go
go copied to clipboard
x/sys/unix: panic when using solaris Event Ports
CAVEAT
This is clearly my own fault, but I am not smart enough (yet?) to figure out where my bug is. It was introduced while attempting to fix a different problem in https://github.com/golang/sys/commit/594fa53f0001ef3fa9e016f148230ddd59727649 as worked on in https://go-review.googlesource.com/c/sys/+/380034
What version of Go are you using (go version
)?
$ go version go version go1.19 solaris/amd64
Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (go env
)?
go env
Output
$ go env warning: GOPATH set to GOROOT (/home/nshalman/go/) has no effect GO111MODULE="" GOARCH="amd64" GOBIN="" GOCACHE="/home/nshalman/.cache/go-build" GOENV="/home/nshalman/.config/go/env" GOEXE="" GOEXPERIMENT="" GOFLAGS="" GOHOSTARCH="amd64" GOHOSTOS="solaris" GOINSECURE="" GOMODCACHE="/home/nshalman/go/pkg/mod" GONOPROXY="" GONOSUMDB="" GOOS="solaris" GOPATH="/home/nshalman/go/" GOPRIVATE="" GOPROXY="https://proxy.golang.org,direct" GOROOT="/home/nshalman/go" GOSUMDB="sum.golang.org" GOTMPDIR="" GOTOOLDIR="/home/nshalman/go/pkg/tool/solaris_amd64" GOVCS="" GOVERSION="go1.19" GCCGO="gccgo" GOAMD64="v1" AR="ar" CC="gcc" CXX="g++" CGO_ENABLED="1" GOMOD="/home/nshalman/sys/go.mod" GOWORK="" CGO_CFLAGS="-g -O2" CGO_CPPFLAGS="" CGO_CXXFLAGS="-g -O2" CGO_FFLAGS="-g -O2" CGO_LDFLAGS="-g -O2" PKG_CONFIG="pkg-config" GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build3559396907=/tmp/go-build -gno-record-gcc-switches"
What did you do?
See https://github.com/fsnotify/fsnotify/pull/371#issuecomment-1204736359
git clone https://github.com/nshalman/fsnotify -b fen-v2
cd fsnotify
mkdir deleteme
go run ./cmd/fsnotify deleteme/ &
for i in {1..500}; do touch deleteme/$i; done;
What did you expect to see?
No panics
What did you see instead?
panic: mismanaged memory
goroutine 33 [running]:
golang.org/x/sys/unix.(*EventPort).peIntToExt(0xc000200150, 0xc0004f52c0, 0xc000105da8)
/home/nshalman/go/pkg/mod/golang.org/x/[email protected]/unix/syscall_solaris.go:959 +0x265
golang.org/x/sys/unix.(*EventPort).Get(0xc000200150, {0xc000105da8, 0x8, 0x0?}, 0x0?, 0x0?)
/home/nshalman/go/pkg/mod/golang.org/x/[email protected]/unix/syscall_solaris.go:1002 +0x2e9
github.com/fsnotify/fsnotify.(*Watcher).readEvents(0xc000200060)
/home/nshalman/deleteme/fsnotify/fen.go:148 +0x10a
created by github.com/fsnotify/fsnotify.NewWatcher
/home/nshalman/deleteme/fsnotify/fen.go:47 +0x165
exit status 2
Reproduction is hard and inconsistent in how long it takes. Here's a test case that can eventually trigger it at least sometimes:
diff --git a/unix/syscall_solaris_test.go b/unix/syscall_solaris_test.go
index c2b28be..d597355 100644
--- a/unix/syscall_solaris_test.go
+++ b/unix/syscall_solaris_test.go
@@ -12,7 +12,9 @@ import (
"io/ioutil"
"os"
"os/exec"
+ "path/filepath"
"runtime"
+ "sync"
"testing"
"golang.org/x/sys/unix"
Then add this test
func TestEventPortMemoryStress(t *testing.T) {
path, err := os.MkdirTemp("", "eventport")
if err != nil {
t.Fatalf("unable to create a tempdir: %v", err)
}
defer os.RemoveAll(path)
stat, err := os.Stat(path)
if err != nil {
t.Fatalf("Failed to stat %s: %v", path, err)
}
port, err := unix.NewEventPort()
if err != nil {
t.Fatalf("NewEventPort failed: %v", err)
}
defer port.Close()
cookie := stat.Mode()
err = port.AssociatePath(path, stat, unix.FILE_MODIFIED, cookie)
if err != nil {
t.Errorf("AssociatePath failed: %v", err)
}
if !port.PathIsWatched(path) {
t.Errorf("PathIsWatched unexpectedly returned false")
}
c := make(chan int)
done := make(chan bool)
var mu sync.Mutex
go func (c chan int, done chan bool) {
for {
_, err = port.GetOne(nil)
if err != nil {
t.Errorf("GetOne failed: %v", err)
}
mu.Lock()
err = port.AssociatePath(path, stat, unix.FILE_MODIFIED, cookie)
mu.Unlock()
select {
case _, _ = <-done:
return
default:
if err != nil {
t.Errorf("AssociatePath failed: %v", err)
}
}
c <- 1
}
} (c, done)
iterations := 500000
for i := 0; i < iterations; i++ {
mu.Lock()
file, err := os.Create(filepath.Join(path, fmt.Sprintf("%d", i)))
if err != nil {
t.Fatalf("unable to create files in %s: %v", path, err)
}
file.Close()
mu.Unlock()
}
var sum int
for i := 0; i < iterations; i++ {
sum += <-c
}
done <- true
if sum != iterations {
t.Errorf("didn't get all %d events", iterations)
}
}
It usually takes many seconds before it triggers, e.g.:
$ go test -run TestEventPortMemoryStress
panic: mismanaged memory
goroutine 19 [running]:
golang.org/x/sys/unix.(*EventPort).peIntToExt(0xc00010a870, 0xc000068ed0, 0xc000114800)
/home/nshalman/sys/unix/syscall_solaris.go:959 +0x265
golang.org/x/sys/unix.(*EventPort).GetOne(0xc00010a870, 0xc000112018?)
/home/nshalman/sys/unix/syscall_solaris.go:932 +0xeb
golang.org/x/sys/unix_test.TestEventPortMemoryStress.func1(0xc00010e820?, 0x5ad410?)
/home/nshalman/sys/unix/syscall_solaris_test.go:368 +0xb4
created by golang.org/x/sys/unix_test.TestEventPortMemoryStress
/home/nshalman/sys/unix/syscall_solaris_test.go:366 +0x5fb
exit status 2
FAIL golang.org/x/sys/unix 27.582s
CC @golang/runtime, @ianlancetaylor.
I'm not sure what the issue is, but in your test the c
channel is unbuffered. As a result the AssosciatePath goroutine loops only runs once concurrently with the os.Create loop (because c <- 1
will block until sum += <-c
, which is after the os.Create loop).
Is that intentional? It looks like you want those goroutines to race as much as possible? If so, eliminating c
in favor of a sync.WaitGroup
may make this reproduce more readily.
FWIW, at https://cs.opensource.google/go/x/sys/+/master:unix/syscall_solaris.go;l=954 uintptr(unsafe.Pointer(fCookie.fobj)) == uintptr(peInt.Object)
is a violation of the unsafe.Pointer
requirements. It is not safe to store a Go pointer as uintptr or uint64.
That said, I don't immediately see how that would cause problems, as the EventPort.cookies
map is keeping fobj
alive, and we don't move objects.
For posterity I'm updating this comment with the version of the test that I ended up using to further debug this issue and to test the fix.
func TestEventPortMemoryStress(t *testing.T) {
path, err := os.MkdirTemp("", "eventport")
if err != nil {
t.Fatalf("unable to create a tempdir: %v", err)
}
defer os.RemoveAll(path)
stat, err := os.Stat(path)
if err != nil {
t.Fatalf("Failed to stat %s: %v", path, err)
}
port, err := unix.NewEventPort()
if err != nil {
t.Fatalf("NewEventPort failed: %v", err)
}
defer port.Close()
iterations := 100000
for i := 0; i < iterations; i++ {
cookie := fmt.Sprintf("cookie %d", i)
err = port.AssociatePath(path, stat, unix.FILE_MODIFIED, cookie)
if err != nil {
t.Errorf("AssociatePath failed: %v", err)
}
if !port.PathIsWatched(path) {
t.Errorf("PathIsWatched unexpectedly returned false")
}
file, err := os.Create(filepath.Join(path, fmt.Sprintf("%d", i)))
if err != nil {
t.Fatalf("unable to create files in %s: %v", path, err)
}
file.Close()
err = os.Remove(filepath.Join(path, fmt.Sprintf("%d", i)))
if err != nil {
t.Errorf("os.Remove failed: %v", err)
}
_, err = port.GetOne(nil)
if err != nil {
t.Errorf("GetOne failed: %v", err)
}
}
}
Yeah... If I run that test with GOGC=off
it never panics. Either I'm not creating the references the garbage collector needs so that it understands what I'm trying to do (more likely) or maybe there's a garbage collector bug (less likely)
GOGC=off go test -v -count 30 -run TestEventPortMemoryStress
ran to completion for me (n=1)
Without GOGC=off
I have yet to get through 30 iterations. (n > 10)
FWIW, at https://cs.opensource.google/go/x/sys/+/master:unix/syscall_solaris.go;l=954
uintptr(unsafe.Pointer(fCookie.fobj)) == uintptr(peInt.Object)
is a violation of theunsafe.Pointer
requirements. It is not safe to store a Go pointer as uintptr or uint64.That said, I don't immediately see how that would cause problems, as the
EventPort.cookies
map is keepingfobj
alive, and we don't move objects.
Based on the discovery that the panics are pretty clearly related to garbage collection, I think you're onto something.
Is there a way I could change the code to not violate the unsafe.Pointer
requirements?
I thought all of these maps were safe (the key in fds
is an actual file descriptor, not a pointer...):
fds map[uintptr]*fileObjCookie
paths map[string]*fileObjCookie
cookies map[*interface{}]*fileObjCookie
One thing that I appear to be doing that could maybe be changed is that I'm working pretty hard to pass the pointer to the exact thing that the user gave me to the call to port_associate, but maybe I'm trying too hard. As long as I pass in some pointer in that is not going to get garbage collected, as long as I can figure out what it points to when I get it back, I can then return to the user whatever they asked me to associate...
Well, I tried simplifying things and while that makes all the code way easier to read, and the tests pass with GOGC=off
,
the tests still fail when the garbage collector is allowed to run.
For now, my code is in a github branch here: https://github.com/golang/sys/compare/master...nshalman:sys:simplify-and-test
The simplification looks like a good idea to me, but doesn't solve the problem. Maybe there really is a GC bug?! I'm happy to open a CR for the simplification if that will be helpful.
I thought all of these maps were safe (the key in
fds
is an actual file descriptor, not a pointer...):fds map[uintptr]*fileObjCookie paths map[string]*fileObjCookie cookies map[*interface{}]*fileObjCookie
These are all fine. Storing as *fileObjCookie
, *interface{}
(though a bit odd), and unsafe.Pointer
are all fine.
Casting to uintptr
(even temporarily) is a problem because the GC loses track of the object and if it is not otherwise kept alive the GC may free the object while it is still in use.
This code is full of unsafe uintptr casts. e.g., https://cs.opensource.google/go/x/sys/+/master:unix/syscall_solaris.go;l=833-834 casts fobj
to uintptr in the argument to port_associate
, which is not safe [1]. Yet fCookie
references fobj
and is used later in the function, so it should be keeping the object alive.
Given the amount of uintptr code here, I suspect that one of these cases fails to keep an object alive and is causing the problem. If you have trouble determining the problematic spots, one debugging technique would be to add runtime.KeepAlive
calls for any potentially problematic objects.
[1] Note: you've probably seen case 4 in unsafe.Pointer
about casting to uintptr in the argument to syscall functions. That rule only applies to assembly functions that have uintptr argument, or Go functions with //go:uintptrkeepalive
(such as syscall.Syscall
on Linux). port_associate
is neither, so fobj
could die during the function call.
I suspect it would also be helpful to know in the panic("mismanaged memory")
case whether the cookie is missing from the map, or the pointers mismatched (and if so what their values are. do they look like valid pointers?)
I've pushed up a commit with some heavy-handed debugging to answer those questions at https://github.com/nshalman/sys/commit/1bbdf4655b7477087d48f28f5cf216d4a46861de
Again, GOGC=off go test
completes with no issues.
go test
will now pretty reliably trigger the error message, but I find it particularly baffling most of the time:
$ go test
Cookie received from port_get at 0xc000492750: cookie 42866
A cookie in the jar at 0xc000492000: cookie 42886
Last cookie port_associate'd at 0xc000492000: cookie 42886
Address at port_associate 0xc000492000
--- FAIL: TestEventPortMemoryStress (0.58s)
panic: mismanaged memory [recovered]
panic: mismanaged memory
<snip>
$ go test
Cookie received from port_get at 0xc00046acc0: cookie 34760
A cookie in the jar at 0xc00046a000: cookie 34775
Last cookie port_associate'd at 0xc00046a000: cookie 34775
Address at port_associate 0xc00046a000
--- FAIL: TestEventPortMemoryStress (0.55s)
panic: mismanaged memory [recovered]
panic: mismanaged memory
<snip>
So it looks like port_get
is getting back a different address than the one that was provided to port_associate
, but again, this odd behavior only happens when the garbage collector is enabled.
However, on one run I saw the most confusing output of all:
$ go test
Cookie received from port_get at 0xc0004306f0: cookie 61295
A cookie in the jar at 0xc000430000: cookie 61295
Last cookie port_associate'd at 0xc000430000: cookie 61295
Address at port_associate 0xc000430000
--- FAIL: TestEventPortMemoryStress (0.87s)
panic: mismanaged memory [recovered]
panic: mismanaged memory
<snip>
Instead of getting back the wrong cookie, I somehow got the right cookie but at a different address.
The address seeming to be different at port_get
and port_associate
being triggered by the garbage collector feels like maybe it's an OS bug rather than a golang bug, but I really don't know.
Can we add the OS-illumos tag as well as that's where I'm doing my development and testing. CC @jclulow @4ad
Had a brief discussion with @rmustacc on IRC who helped me come up with some dtrace to look a little closer at what's going on from the OS side.
fbt:portfs:port_associate_fop:entry
{
printf(" in: %p->%p", args[2], args[4]);
}
fbt::port_copy_event:entry
/args[1]->portkev_source == 7/
{
printf("out: %x->%p\n", args[1]->portkev_object, args[1]->portkev_user);
}
fbt::port_pfp_setup:entry
{
printf(" in: %x->%p", args[4], args[6]);
}
14 70111 port_associate_fop:entry in: c000130000->c000016090
14 70103 port_pfp_setup:entry in: c000130000->c000016090
14 69995 port_copy_event:entry out: c000130000->c000016090
14 70111 port_associate_fop:entry in: c000130050->c0000a0780
14 70103 port_pfp_setup:entry in: c000130050->c0000a0780
14 69995 port_copy_event:entry out: c000130050->c0000a0780
14 70111 port_associate_fop:entry in: c00051e000->c000016060
14 69995 port_copy_event:entry out: c00051e000->c000516b70
I think the lack of port_pfp_setup:entry
is probably significant.
Cookie received from port_get at 0xc000516b70: cookie 5796
A cookie in the jar at 0xc000016060: cookie 5814
Last cookie port_associate'd at 0xc000016060: cookie 5814
Address at port_associate 0xc000016060
--- FAIL: TestEventPortMemoryStress (0.11s)
panic: mismanaged memory [recovered]
panic: mismanaged memory
I've filed https://www.illumos.org/issues/14898 in case this is indeed a bug on the illumos end.
This is clearly my own fault, but I am not smart enough (yet?) to figure out where my bug is.
Oh sweet summer child...
I'm now convinced that this is https://www.illumos.org/issues/14898 rather than an issue with the code in x/sys/unix. I have tested https://code.illumos.org/c/illumos-gate/+/2299 (as of patchset 1) and it made this problem go away for me.
Since it will take a while for that fix to land, and longer for it to make it out to all machines in production, I'm inclined to updated the panic message to reference this issue. Perhaps something along the lines of:
panic("unexpected address received from event port (issue #54254)")
I think there is a nontrivial chance that once fsnotify/fsnotify#371 lands (which this bug was preventing me from feeling comfortable landing), illumos and Solaris (which might have this same issue) systems might start tripping over this panic and I want to provide users with a hint of what to do about it.
Thank you everyone for your help guiding me into narrowing down the issue. Your feedback on the panic message is requested so that I know what to put into the CR.
Change https://go.dev/cl/422099 mentions this issue: unix: Improve solaris event port panic message
I've updated the description at the top of this issue to include some helpful text to anyone directed here from the error message proposed in https://go.dev/cl/422099
I've tweaked the language in the description one more time now that golang/sys@c680a09 has landed. Thank you to everyone!
Sorry to drive-by this issue with a tangential one, but was anything determined about the uintptr
casts that @prattmic mentioned above might be unsafe? I'm digging a bit into #53289 and several apparently-related failure modes (plus https://github.com/oxidecomputer/omicron/issues/1130) and I'm wondering if they could be related to this one.
@davepacheco The code was rewritten (https://go.dev/cl/422338) to avoid the unsafe conversions.
The exact rules can be found at https://pkg.go.dev/unsafe#Pointer.
Thanks -- that's helpful! Do you know what release(s) that will wind up in (or is there some way for me to figure that out)?
Sorry, I'm not sure exactly what you are asking. If you are asking about the changes to golang.org/x/sys/unix, the golang.org/x/sys/unix package doesn't currently have a release system. We expect people to update (go get -u
) the package as needed to pick up new features and bug fixes.
(Systematic releases for the golang.org/x repos is #21324, but nobody is thinking about that as far as I know.)
My mistake. I thought this package was tied to the Go runtime and so needed a new Go release to get the fix. (Sorry for my ignorance here...I'm a user of software written in Go but I haven't worked with Go much myself.)
@davepacheco No worries. (For future notice, see https://go.dev/wiki/Questions for good places to ask questions about Go.)