go icon indicating copy to clipboard operation
go copied to clipboard

x/sys/unix: panic when using solaris Event Ports

Open nshalman opened this issue 1 year ago • 12 comments

CAVEAT

This is clearly my own fault, but I am not smart enough (yet?) to figure out where my bug is. It was introduced while attempting to fix a different problem in https://github.com/golang/sys/commit/594fa53f0001ef3fa9e016f148230ddd59727649 as worked on in https://go-review.googlesource.com/c/sys/+/380034

What version of Go are you using (go version)?

$ go version
go version go1.19 solaris/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
warning: GOPATH set to GOROOT (/home/nshalman/go/) has no effect
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/nshalman/.cache/go-build"
GOENV="/home/nshalman/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="solaris"
GOINSECURE=""
GOMODCACHE="/home/nshalman/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="solaris"
GOPATH="/home/nshalman/go/"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/home/nshalman/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/home/nshalman/go/pkg/tool/solaris_amd64"
GOVCS=""
GOVERSION="go1.19"
GCCGO="gccgo"
GOAMD64="v1"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/home/nshalman/sys/go.mod"
GOWORK=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build3559396907=/tmp/go-build -gno-record-gcc-switches"

What did you do?

See https://github.com/fsnotify/fsnotify/pull/371#issuecomment-1204736359

git clone https://github.com/nshalman/fsnotify -b fen-v2
cd fsnotify
mkdir deleteme
go run ./cmd/fsnotify deleteme/ &
for i in {1..500}; do touch deleteme/$i; done;

What did you expect to see?

No panics

What did you see instead?

panic: mismanaged memory

goroutine 33 [running]:
golang.org/x/sys/unix.(*EventPort).peIntToExt(0xc000200150, 0xc0004f52c0, 0xc000105da8)
        /home/nshalman/go/pkg/mod/golang.org/x/[email protected]/unix/syscall_solaris.go:959 +0x265
golang.org/x/sys/unix.(*EventPort).Get(0xc000200150, {0xc000105da8, 0x8, 0x0?}, 0x0?, 0x0?)
        /home/nshalman/go/pkg/mod/golang.org/x/[email protected]/unix/syscall_solaris.go:1002 +0x2e9
github.com/fsnotify/fsnotify.(*Watcher).readEvents(0xc000200060)
        /home/nshalman/deleteme/fsnotify/fen.go:148 +0x10a
created by github.com/fsnotify/fsnotify.NewWatcher
        /home/nshalman/deleteme/fsnotify/fen.go:47 +0x165
exit status 2

nshalman avatar Aug 04 '22 13:08 nshalman

Reproduction is hard and inconsistent in how long it takes. Here's a test case that can eventually trigger it at least sometimes:

diff --git a/unix/syscall_solaris_test.go b/unix/syscall_solaris_test.go
index c2b28be..d597355 100644
--- a/unix/syscall_solaris_test.go
+++ b/unix/syscall_solaris_test.go
@@ -12,7 +12,9 @@ import (
        "io/ioutil"
        "os"
        "os/exec"
+       "path/filepath"
        "runtime"
+       "sync"
        "testing"

        "golang.org/x/sys/unix"

Then add this test

func TestEventPortMemoryStress(t *testing.T) {
        path, err := os.MkdirTemp("", "eventport")
        if err != nil {
                t.Fatalf("unable to create a tempdir: %v", err)
        }
        defer os.RemoveAll(path)

        stat, err := os.Stat(path)
        if err != nil {
                t.Fatalf("Failed to stat %s: %v", path, err)
        }
        port, err := unix.NewEventPort()
        if err != nil {
                t.Fatalf("NewEventPort failed: %v", err)
        }
        defer port.Close()
        cookie := stat.Mode()
        err = port.AssociatePath(path, stat, unix.FILE_MODIFIED, cookie)
        if err != nil {
                t.Errorf("AssociatePath failed: %v", err)
        }
        if !port.PathIsWatched(path) {
                t.Errorf("PathIsWatched unexpectedly returned false")
        }

        c := make(chan int)
        done := make(chan bool)
        var mu sync.Mutex
        go func (c chan int, done chan bool) {
                for {
                        _, err = port.GetOne(nil)
                        if err != nil {
                                t.Errorf("GetOne failed: %v", err)
                        }
                        mu.Lock()
                        err = port.AssociatePath(path, stat, unix.FILE_MODIFIED, cookie)
                        mu.Unlock()
                        select {
                        case _, _ = <-done:
                                return
                        default:
                                if err != nil {
                                        t.Errorf("AssociatePath failed: %v", err)
                                }
                        }
                        c <- 1
                }
        } (c, done)

        iterations := 500000
        for i := 0; i < iterations; i++ {
                mu.Lock()
                file, err := os.Create(filepath.Join(path, fmt.Sprintf("%d", i)))
                        if err != nil {
                                t.Fatalf("unable to create files in %s: %v", path, err)
                        }
                file.Close()
                mu.Unlock()
        }
        var sum int
        for i := 0; i < iterations; i++ {
                sum += <-c
        }
        done <- true
        if sum != iterations {
                t.Errorf("didn't get all %d events", iterations)
        }
}

It usually takes many seconds before it triggers, e.g.:

$ go test -run TestEventPortMemoryStress
panic: mismanaged memory

goroutine 19 [running]:
golang.org/x/sys/unix.(*EventPort).peIntToExt(0xc00010a870, 0xc000068ed0, 0xc000114800)
        /home/nshalman/sys/unix/syscall_solaris.go:959 +0x265
golang.org/x/sys/unix.(*EventPort).GetOne(0xc00010a870, 0xc000112018?)
        /home/nshalman/sys/unix/syscall_solaris.go:932 +0xeb
golang.org/x/sys/unix_test.TestEventPortMemoryStress.func1(0xc00010e820?, 0x5ad410?)
        /home/nshalman/sys/unix/syscall_solaris_test.go:368 +0xb4
created by golang.org/x/sys/unix_test.TestEventPortMemoryStress
        /home/nshalman/sys/unix/syscall_solaris_test.go:366 +0x5fb
exit status 2
FAIL    golang.org/x/sys/unix   27.582s

nshalman avatar Aug 04 '22 14:08 nshalman

CC @golang/runtime, @ianlancetaylor.

dmitshur avatar Aug 04 '22 18:08 dmitshur

I'm not sure what the issue is, but in your test the c channel is unbuffered. As a result the AssosciatePath goroutine loops only runs once concurrently with the os.Create loop (because c <- 1 will block until sum += <-c, which is after the os.Create loop).

Is that intentional? It looks like you want those goroutines to race as much as possible? If so, eliminating c in favor of a sync.WaitGroup may make this reproduce more readily.

prattmic avatar Aug 04 '22 19:08 prattmic

FWIW, at https://cs.opensource.google/go/x/sys/+/master:unix/syscall_solaris.go;l=954 uintptr(unsafe.Pointer(fCookie.fobj)) == uintptr(peInt.Object) is a violation of the unsafe.Pointer requirements. It is not safe to store a Go pointer as uintptr or uint64.

That said, I don't immediately see how that would cause problems, as the EventPort.cookies map is keeping fobj alive, and we don't move objects.

prattmic avatar Aug 04 '22 19:08 prattmic

For posterity I'm updating this comment with the version of the test that I ended up using to further debug this issue and to test the fix.

func TestEventPortMemoryStress(t *testing.T) {
        path, err := os.MkdirTemp("", "eventport")
        if err != nil {
                t.Fatalf("unable to create a tempdir: %v", err)
        }
        defer os.RemoveAll(path)

        stat, err := os.Stat(path)
        if err != nil {
                t.Fatalf("Failed to stat %s: %v", path, err)
        }
        port, err := unix.NewEventPort()
        if err != nil {
                t.Fatalf("NewEventPort failed: %v", err)
        }
        defer port.Close()


        iterations := 100000
        for i := 0; i < iterations; i++ {
                cookie := fmt.Sprintf("cookie %d", i)
                err = port.AssociatePath(path, stat, unix.FILE_MODIFIED, cookie)
                if err != nil {
                        t.Errorf("AssociatePath failed: %v", err)
                }
                if !port.PathIsWatched(path) {
                        t.Errorf("PathIsWatched unexpectedly returned false")
                }
                file, err := os.Create(filepath.Join(path, fmt.Sprintf("%d", i)))
                if err != nil {
                        t.Fatalf("unable to create files in %s: %v", path, err)
                }
                file.Close()
                err = os.Remove(filepath.Join(path, fmt.Sprintf("%d", i)))
                if err != nil {
                        t.Errorf("os.Remove failed: %v", err)
                }

                _, err = port.GetOne(nil)
                if err != nil {
                        t.Errorf("GetOne failed: %v", err)
                }
        }
}

nshalman avatar Aug 05 '22 00:08 nshalman

Yeah... If I run that test with GOGC=off it never panics. Either I'm not creating the references the garbage collector needs so that it understands what I'm trying to do (more likely) or maybe there's a garbage collector bug (less likely)

GOGC=off go test -v -count 30 -run TestEventPortMemoryStress ran to completion for me (n=1)

Without GOGC=off I have yet to get through 30 iterations. (n > 10)

nshalman avatar Aug 05 '22 01:08 nshalman

FWIW, at https://cs.opensource.google/go/x/sys/+/master:unix/syscall_solaris.go;l=954 uintptr(unsafe.Pointer(fCookie.fobj)) == uintptr(peInt.Object) is a violation of the unsafe.Pointer requirements. It is not safe to store a Go pointer as uintptr or uint64.

That said, I don't immediately see how that would cause problems, as the EventPort.cookies map is keeping fobj alive, and we don't move objects.

Based on the discovery that the panics are pretty clearly related to garbage collection, I think you're onto something. Is there a way I could change the code to not violate the unsafe.Pointer requirements?

I thought all of these maps were safe (the key in fds is an actual file descriptor, not a pointer...):

	fds   map[uintptr]*fileObjCookie
	paths map[string]*fileObjCookie
	cookies map[*interface{}]*fileObjCookie

One thing that I appear to be doing that could maybe be changed is that I'm working pretty hard to pass the pointer to the exact thing that the user gave me to the call to port_associate, but maybe I'm trying too hard. As long as I pass in some pointer in that is not going to get garbage collected, as long as I can figure out what it points to when I get it back, I can then return to the user whatever they asked me to associate...

nshalman avatar Aug 05 '22 01:08 nshalman

Well, I tried simplifying things and while that makes all the code way easier to read, and the tests pass with GOGC=off, the tests still fail when the garbage collector is allowed to run.

For now, my code is in a github branch here: https://github.com/golang/sys/compare/master...nshalman:sys:simplify-and-test

The simplification looks like a good idea to me, but doesn't solve the problem. Maybe there really is a GC bug?! I'm happy to open a CR for the simplification if that will be helpful.

nshalman avatar Aug 05 '22 02:08 nshalman

I thought all of these maps were safe (the key in fds is an actual file descriptor, not a pointer...):

	fds   map[uintptr]*fileObjCookie
	paths map[string]*fileObjCookie
	cookies map[*interface{}]*fileObjCookie

These are all fine. Storing as *fileObjCookie, *interface{} (though a bit odd), and unsafe.Pointer are all fine.

Casting to uintptr (even temporarily) is a problem because the GC loses track of the object and if it is not otherwise kept alive the GC may free the object while it is still in use.

This code is full of unsafe uintptr casts. e.g., https://cs.opensource.google/go/x/sys/+/master:unix/syscall_solaris.go;l=833-834 casts fobj to uintptr in the argument to port_associate, which is not safe [1]. Yet fCookie references fobj and is used later in the function, so it should be keeping the object alive.

Given the amount of uintptr code here, I suspect that one of these cases fails to keep an object alive and is causing the problem. If you have trouble determining the problematic spots, one debugging technique would be to add runtime.KeepAlive calls for any potentially problematic objects.

[1] Note: you've probably seen case 4 in unsafe.Pointer about casting to uintptr in the argument to syscall functions. That rule only applies to assembly functions that have uintptr argument, or Go functions with //go:uintptrkeepalive (such as syscall.Syscall on Linux). port_associate is neither, so fobj could die during the function call.

prattmic avatar Aug 05 '22 14:08 prattmic

I suspect it would also be helpful to know in the panic("mismanaged memory") case whether the cookie is missing from the map, or the pointers mismatched (and if so what their values are. do they look like valid pointers?)

prattmic avatar Aug 05 '22 14:08 prattmic

I've pushed up a commit with some heavy-handed debugging to answer those questions at https://github.com/nshalman/sys/commit/1bbdf4655b7477087d48f28f5cf216d4a46861de Again, GOGC=off go test completes with no issues. go test will now pretty reliably trigger the error message, but I find it particularly baffling most of the time:

$ go test
Cookie received from port_get at 0xc000492750: cookie 42866
          A cookie in the jar at 0xc000492000: cookie 42886
 Last cookie port_associate'd at 0xc000492000: cookie 42886
       Address at port_associate 0xc000492000
--- FAIL: TestEventPortMemoryStress (0.58s)
panic: mismanaged memory [recovered]
        panic: mismanaged memory
<snip>
$ go test
Cookie received from port_get at 0xc00046acc0: cookie 34760
          A cookie in the jar at 0xc00046a000: cookie 34775
 Last cookie port_associate'd at 0xc00046a000: cookie 34775
       Address at port_associate 0xc00046a000
--- FAIL: TestEventPortMemoryStress (0.55s)
panic: mismanaged memory [recovered]
        panic: mismanaged memory
<snip>

So it looks like port_get is getting back a different address than the one that was provided to port_associate, but again, this odd behavior only happens when the garbage collector is enabled.

However, on one run I saw the most confusing output of all:

$ go test
Cookie received from port_get at 0xc0004306f0: cookie 61295
          A cookie in the jar at 0xc000430000: cookie 61295
 Last cookie port_associate'd at 0xc000430000: cookie 61295
       Address at port_associate 0xc000430000
--- FAIL: TestEventPortMemoryStress (0.87s)
panic: mismanaged memory [recovered]
        panic: mismanaged memory
<snip>

Instead of getting back the wrong cookie, I somehow got the right cookie but at a different address.

The address seeming to be different at port_get and port_associate being triggered by the garbage collector feels like maybe it's an OS bug rather than a golang bug, but I really don't know.

Can we add the OS-illumos tag as well as that's where I'm doing my development and testing. CC @jclulow @4ad

nshalman avatar Aug 07 '22 13:08 nshalman

Had a brief discussion with @rmustacc on IRC who helped me come up with some dtrace to look a little closer at what's going on from the OS side.

fbt:portfs:port_associate_fop:entry
{
	printf(" in: %p->%p", args[2], args[4]);
}
fbt::port_copy_event:entry
/args[1]->portkev_source == 7/
{
	printf("out: %x->%p\n", args[1]->portkev_object, args[1]->portkev_user);
}
 
fbt::port_pfp_setup:entry
{
	printf(" in: %x->%p", args[4], args[6]);
}
 14  70111         port_associate_fop:entry  in: c000130000->c000016090
 14  70103             port_pfp_setup:entry  in: c000130000->c000016090
 14  69995            port_copy_event:entry out: c000130000->c000016090
 
 14  70111         port_associate_fop:entry  in: c000130050->c0000a0780
 14  70103             port_pfp_setup:entry  in: c000130050->c0000a0780
 14  69995            port_copy_event:entry out: c000130050->c0000a0780
 
 14  70111         port_associate_fop:entry  in: c00051e000->c000016060
 14  69995            port_copy_event:entry out: c00051e000->c000516b70

I think the lack of port_pfp_setup:entry is probably significant.

Cookie received from port_get at 0xc000516b70: cookie 5796
          A cookie in the jar at 0xc000016060: cookie 5814
 Last cookie port_associate'd at 0xc000016060: cookie 5814
       Address at port_associate 0xc000016060
--- FAIL: TestEventPortMemoryStress (0.11s)
panic: mismanaged memory [recovered]
	panic: mismanaged memory

I've filed https://www.illumos.org/issues/14898 in case this is indeed a bug on the illumos end.

nshalman avatar Aug 07 '22 18:08 nshalman

This is clearly my own fault, but I am not smart enough (yet?) to figure out where my bug is.

Oh sweet summer child...

I'm now convinced that this is https://www.illumos.org/issues/14898 rather than an issue with the code in x/sys/unix. I have tested https://code.illumos.org/c/illumos-gate/+/2299 (as of patchset 1) and it made this problem go away for me.

Since it will take a while for that fix to land, and longer for it to make it out to all machines in production, I'm inclined to updated the panic message to reference this issue. Perhaps something along the lines of: panic("unexpected address received from event port (issue #54254)")

I think there is a nontrivial chance that once fsnotify/fsnotify#371 lands (which this bug was preventing me from feeling comfortable landing), illumos and Solaris (which might have this same issue) systems might start tripping over this panic and I want to provide users with a hint of what to do about it.

Thank you everyone for your help guiding me into narrowing down the issue. Your feedback on the panic message is requested so that I know what to put into the CR.

nshalman avatar Aug 09 '22 01:08 nshalman

Change https://go.dev/cl/422099 mentions this issue: unix: Improve solaris event port panic message

gopherbot avatar Aug 09 '22 01:08 gopherbot

I've updated the description at the top of this issue to include some helpful text to anyone directed here from the error message proposed in https://go.dev/cl/422099

nshalman avatar Aug 25 '22 01:08 nshalman

I've tweaked the language in the description one more time now that golang/sys@c680a09 has landed. Thank you to everyone!

nshalman avatar Aug 25 '22 20:08 nshalman

Sorry to drive-by this issue with a tangential one, but was anything determined about the uintptr casts that @prattmic mentioned above might be unsafe? I'm digging a bit into #53289 and several apparently-related failure modes (plus https://github.com/oxidecomputer/omicron/issues/1130) and I'm wondering if they could be related to this one.

davepacheco avatar Oct 24 '22 22:10 davepacheco

@davepacheco The code was rewritten (https://go.dev/cl/422338) to avoid the unsafe conversions.

The exact rules can be found at https://pkg.go.dev/unsafe#Pointer.

ianlancetaylor avatar Oct 24 '22 23:10 ianlancetaylor

Thanks -- that's helpful! Do you know what release(s) that will wind up in (or is there some way for me to figure that out)?

davepacheco avatar Oct 24 '22 23:10 davepacheco

Sorry, I'm not sure exactly what you are asking. If you are asking about the changes to golang.org/x/sys/unix, the golang.org/x/sys/unix package doesn't currently have a release system. We expect people to update (go get -u) the package as needed to pick up new features and bug fixes.

(Systematic releases for the golang.org/x repos is #21324, but nobody is thinking about that as far as I know.)

ianlancetaylor avatar Oct 24 '22 23:10 ianlancetaylor

My mistake. I thought this package was tied to the Go runtime and so needed a new Go release to get the fix. (Sorry for my ignorance here...I'm a user of software written in Go but I haven't worked with Go much myself.)

davepacheco avatar Oct 24 '22 23:10 davepacheco

@davepacheco No worries. (For future notice, see https://go.dev/wiki/Questions for good places to ask questions about Go.)

ianlancetaylor avatar Oct 25 '22 03:10 ianlancetaylor