go icon indicating copy to clipboard operation
go copied to clipboard

x/mobile: rendering freeze

Open changkun opened this issue 2 years ago • 8 comments

What version of Go are you using (go version)?

$ go version
go version devel go1.18-9133245be7 Sat Sep 4 20:35:25 2021 +0000 darwin/arm64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE="auto"
GOARCH="arm64"
GOBIN=""
GOCACHE="/Users/changkun/Library/Caches/go-build"
GOENV="/Users/changkun/Library/Application Support/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="arm64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/changkun/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/changkun/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/Users/changkun/sdk/gotip"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/Users/changkun/sdk/gotip/pkg/tool/darwin_arm64"
GOVCS=""
GOVERSION="devel go1.18-9133245be7 Sat Sep 4 20:35:25 2021 +0000"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/Users/changkun/dev/changkun.de/fyne/iosrend/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -arch arm64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/g2/6fmr1qzx0ns3shq74zrp6bd40000gn/T/go-build130131505=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

This issue is observed in practice, see https://github.com/hajimehoshi/ebiten/issues/1332, https://github.com/hajimehoshi/ebiten/issues/1322, https://github.com/fyne-io/fyne/issues/950.

While working on an (almost) reproducer. The problem seems to happen more often on iOS/Android and was not observed on the desktop.

A recent debugging on the rendering freeze seems to suggest gl package might block forever when using golang.org/x/mobile/app and golang.org/x/mobile/gl packages, as the rendering loop, stops processing GL calls after a while and the workAvaliable fails to receive more events.

Furthermore, according to a discussion in CL 350212, it seems #48433 might be related.

What did you expect to see?

No rendering freeze

What did you see instead?

Rendering freeze

changkun avatar Sep 17 '21 13:09 changkun

Change https://golang.org/cl/350212 mentions this issue: gl: fix dead lock between package app and gl

gopherbot avatar Sep 17 '21 13:09 gopherbot

/cc @hyangah

cagedmantis avatar Sep 20 '21 15:09 cagedmantis

Change https://golang.org/cl/351769 mentions this issue: gl: fix rendering freeze between app and gl

gopherbot avatar Sep 23 '21 14:09 gopherbot

I simplified the DoWork function a little bit to the following (Removed cgo batching):

func (ctx *context) DoWork() {
	for {
		select {
		case w := <-ctx.work:
			ret := C.processFn((*C.struct_fnargs)(unsafe.Pointer(&w.args)), (*C.char)(w.parg))
			if w.blocking {
				ctx.retvalue <- ret
			}
		default:
			return
		}
	}
}

And also captured some information from stack traces, it looks the problem is caused by a cgo call that has not yet returned.

(dlv) gr 5
Switched from 0 to 5 (thread 2652429)
(dlv) frame 0
> runtime.systemstack_switch() ./sdk/gotip/src/runtime/asm_amd64.s:311 (PC: 0x4066560)
Warning: debugging optimized function
Frame 0: ./sdk/gotip/src/runtime/asm_amd64.s:311 (PC: 4066560)
   306: // of the G stack. We need to distinguish the routine that
   307: // lives at the bottom of the G stack from the one that lives
   308: // at the top of the system stack because the one at the top of
   309: // the system stack terminates the stack walk (see topofstack()).
   310: TEXT runtime·systemstack_switch(SB), NOSPLIT, $0-0
=> 311:         RET
   312:
   313: // func systemstack(fn func())
   314: TEXT runtime·systemstack(SB), NOSPLIT, $0-8
   315:         MOVQ    fn+0(FP), DI    // DI = fn
   316:         get_tls(CX)
(dlv) up
> runtime.systemstack_switch() ./sdk/gotip/src/runtime/asm_amd64.s:311 (PC: 0x4066560)
Warning: debugging optimized function
Frame 1: ./sdk/gotip/src/runtime/cgocall.go:167 (PC: 40054e5)
   162:         // sync preemption will succeed (though this is not a matter
   163:         // of correctness).
   164:         osPreemptExtEnter(mp)
   165:
   166:         mp.incgo = true
=> 167:         errno := asmcgocall(fn, arg)
   168:
   169:         // Update accounting before exitsyscall because exitsyscall may
   170:         // reschedule us on to a different M.
   171:         mp.incgo = false
   172:         mp.ncgo--

From the goroutine stack trace side, it looks like the goroutine is not yet returned from a Cgo call. The relevant thread backtrace (from darwin/amd64) is:

  thread #8
    frame #0: 0x00007fff204bb2ba libsystem_kernel.dylib`mach_msg_trap + 10
    frame #1: 0x00007fff204bb62c libsystem_kernel.dylib`mach_msg + 60
    frame #2: 0x00007fff22caa793 IOKit`io_connect_method + 384
    frame #3: 0x00007fff22caa5e9 IOKit`IOConnectCallMethod + 244
    frame #4: 0x00007fff2846ceb1 IOAccelerator`IOAccelResourceCreate + 124
    frame #5: 0x00007fff4643c077 libGPUSupportMercury.dylib`gpusGetKernelTexture + 380
    frame #6: 0x000000003a63312b AMDRadeonX4000GLDriver`glrATI_SI_LoadSysTextureStandard + 5982
    frame #7: 0x000000003a63012e AMDRadeonX4000GLDriver`glrUpdateTexture + 724
    frame #8: 0x00007fff4643bc58 libGPUSupportMercury.dylib`gldModifyTexSubImage + 180
    frame #9: 0x00007fff6be14465 GLEngine`glTexImage2D_Exec + 1873
    frame #10: 0x00007fff6bdf3195 libGL.dylib`glTexImage2D + 55
    frame #11: 0x000000000439211f iosrend`processFn + 639
    frame #12: 0x0000000004391e81 iosrend`_cgo_647ba9049754_Cfunc_processFn + 33
    frame #13: 0x0000000004068484 iosrend`runtime.asmcgocall.abi0 + 100
    frame #14: 0x00000000040665c9 iosrend`runtime.systemstack.abi0 + 73
    frame #15: 0x00007fff2031a82f libsystem_malloc.dylib`szone_size + 45
    frame #16: 0x00000000042567ed iosrend`golang.org/x/mobile/gl._Cfunc_processFn.abi0 + 77
    frame #17: 0x0000000004256b05 iosrend`golang.org/x/mobile/gl.(*context).DoWork + 229
    frame #18: 0x0000000004373a2a iosrend`golang.org/x/mobile/app.(*app).loop + 202
    frame #19: 0x0000000004373b8a iosrend`golang.org/x/mobile/app.startloop.func1 + 42
    frame #20: 0x00000000040687a1 iosrend`runtime.goexit.abi0 + 1
    frame #21: 0x0000000004256b05 iosrend`golang.org/x/mobile/gl.(*context).DoWork + 229
    frame #22: 0x0000000004373a2a iosrend`golang.org/x/mobile/app.(*app).loop + 202
    frame #23: 0x0000000004373b8a iosrend`golang.org/x/mobile/app.startloop.func1 + 42
    frame #24: 0x00000000040687a1 iosrend`runtime.goexit.abi0 + 1
    frame #25: 0x0000000004373a2a iosrend`golang.org/x/mobile/app.(*app).loop + 202
    frame #26: 0x0000000004373b8a iosrend`golang.org/x/mobile/app.startloop.func1 + 42
    frame #27: 0x00000000040687a1 iosrend`runtime.goexit.abi0 + 1
    frame #28: 0x0000000004373b8a iosrend`golang.org/x/mobile/app.startloop.func1 + 42
    frame #29: 0x00000000040687a1 iosrend`runtime.goexit.abi0 + 1
    frame #30: 0x00000000040687a1 iosrend`runtime.goexit.abi0 + 1

The relevant thread backtrace from darwin/arm64:

  thread #4
    frame #0: 0x00000001bb04d9b4 libsystem_kernel.dylib`mach_msg_trap + 8
    frame #1: 0x00000001bb04dd60 libsystem_kernel.dylib`mach_msg + 76
    frame #2: 0x00000001bdb28dd0 IOKit`io_connect_method + 440
    frame #3: 0x00000001bdb28bec IOKit`IOConnectCallMethod + 236
    frame #4: 0x00000001d5b1b970 IOGPU`IOGPUResourceCreate + 224
    frame #5: 0x00000001d5b159f8 IOGPU`-[IOGPUMetalResource initWithDevice:remoteStorageResource:options:args:argsSize:] + 476
    frame #6: 0x00000001d5b18ab8 IOGPU`-[IOGPUMetalTexture initWithBuffer:descriptor:sysMemOffset:sysMemRowBytes:vidMemSize:vidMemRowBytes:args:argsSize:isStrideTexture:] + 616
    frame #7: 0x00000001d5b1925c IOGPU`-[IOGPUMetalTexture initWithPrimaryBuffer:heapIndex:bufferIndex:bufferOffset:length:descriptor:sysMemRowBytes:vidMemSize:vidMemRowBytes:args:argsSize:] + 96
    frame #8: 0x000000021c45bc10 AGXMetal13_3`___lldb_unnamed_symbol4663$$AGXMetal13_3 + 468
    frame #9: 0x0000000134553aa8 AppleMetalOpenGLRenderer`GLDTextureRec::loadObj() + 3156
    frame #10: 0x000000013458a960 AppleMetalOpenGLRenderer`gldModifyTexSubImage + 160
    frame #11: 0x000000021bd9a7e8 GLEngine`glTexImage2D_Exec + 1944
    frame #12: 0x000000021bd7b8d0 libGL.dylib`glTexImage2D + 80
    frame #13: 0x00000001047563b4 iosrend`processFn(args=0x000001400039dd38, parg=<unavailable>) at work.c:136:3 [opt]

As far as I can confirm this occurs in all arch of macOS (amd64, arm64), iOS (arm64). I first thought this might be a driver issue, but since Android is also reported in arm64, by @hajimehoshi. It seems unlikely to have a driver issue on two different platforms (?), especially the backend if different (darwin/arm64 is using Metal as backend, apparently Android can't use that).

From the stack traces, the io_connect_method is waiting for something, the GPU???

Any hint would be highly appreciated...

changkun avatar Sep 28 '21 07:09 changkun

This is a feedback from Apple (9665054):

Getting a hang, or long pauses here, usually means there's a Metal resource, or GL texture leak. The kernel side of IOGPUFamily will eventually start stalling for long periods of time once we reach the per-process limit on the number of resources created by the process (which is in the tens of thousands).

changkun avatar Jan 15 '22 09:01 changkun

https://github.com/golang/go/issues/52116 Isn't this related?

hajimehoshi avatar Aug 02 '22 12:08 hajimehoshi

@hajimehoshi could you maybe elaborate a bit more on how they are related?

changkun avatar Aug 06 '22 08:08 changkun

NVM, #52116 was a crash rather than freezing so this was not related

hajimehoshi avatar Aug 06 '22 15:08 hajimehoshi