lwjgl3 icon indicating copy to clipboard operation
lwjgl3 copied to clipboard

BGFX macOS 11.1 Metal backend crash/freeze/no render

Open SergeySave opened this issue 4 years ago • 8 comments
trafficstars

Environment

  • LWJGL version: 3.2.3 and 3.2.4-SNAPSHOT
  • LWJGL build #: Unknown
  • Java version: 14.0.1-open
  • Platform: macOS 11.1
  • Module: bgfx

Description

The BGFX Metal backend does not work on macOS 11.1. I tested the commit on top of the master branch of BGFX (as of the time of writing this that is this commit) and the examples run perfectly with the Metal backend on this machine. However testing both the most recent commit (currently this one) and a slightly modified version of the last commit which used lwjgl version 3.2.3 (this one).

(Note: I only tested the Metaballs example because it was the most similar to the program where I'd discovered this problem.) Running that most recent commit results in the program getting stuck calling bgfx_init (this line). On the other hand running the older version (where I modified the pom.xml such that the lwjgl.version was 3.2.3 and the jowl.version was 1.9.25 in order to get it to compile) the window shows up but nothing renders to it. Additionally (while I didn't see this with the Metaballs example), my own code would crash rather than display nothing when trying to render a specific thing.

I suspect that it has something to do with a macOS update as the metal backend worked for me about a year ago.

SergeySave avatar Jan 19 '21 22:01 SergeySave

Thanks @SergeySave. I'm busy with other stuff atm, but I have reproduced this. Will update when I'm able to test more.

Spasi avatar Feb 25 '21 17:02 Spasi

Looks like bkaradzic/bgfx#2036. The suggested workaround can be implemented using LWJGL's Objective-C Runtime bindings, like this:

BGFXInit init /*= ...*/;

// replace this:
init.platformData()
    .nwh(glfwGetCocoaWindow(window));

// with this:
long objc_msgSend = ObjCRuntime.getLibrary().getFunctionAddress("objc_msgSend");

long layer = invokePPP(objc_getClass("CAMetalLayer"), sel_getUid("alloc"), objc_msgSend);
invokePPP(layer, sel_getUid("init"), objc_msgSend);

long contentView = invokePPP(glfwGetCocoaWindow(window), sel_getUid("contentView"), objc_msgSend);
invokePPPV(contentView, sel_getUid("setLayer:"), layer, objc_msgSend);

init.platformData()
    .nwh(layer);

It works for me, but I'm not sure what the implications will be. It replaces the layer created by GLFW internally, with a layer that GLFW knows nothing about.

Spasi avatar Feb 28 '21 15:02 Spasi

The above workaround doesn't work anymore, as bgfx has gained more places it attempts to call into the main thread on macOS. Issues on the bgfx repo are a bit of a red herring, as they are in the context of a macOS application that uses a proper NSApplication, but initializes bgfx incorrectly resulting in a deadlock.

In Java land, with -XstartOnFirstThread (as is the advice when using LWJGL for OpenGL directly), we don't have an NSApplication, and therefore no main loop. bgfx's calls will never return no matter what — it is not a deadlock, the main loop simply does not exist. The actual fix to this is that you must not use -XstartOnFirstThread alongside bgfx, as bgfx expects there to be a main loop on macOS. (This is why LWJGL issues on the bgfx repo are being closed with a nebulous response of "use threading correctly")

If you do not pass -XstartOnFirstThread, the JVM will set up the main thread for you correctly under the expectation you may use AWT in the future, but it does not initialize the toolkit — only the main loop. From here, making this work is as simple as choosing glfw_async as your GLFW implementation library, so that GLFW also calls into the main loop like bgfx does.

The bgfx examples should probably be updated based on this information. As it stands they are misleading, since they're direct transliterations of the bgfx samples without bgfx's entry system, which hides this important detail on macOS — namely, that you need an NSApplication (or other means of running a main loop) and you must init bgfx off-thread.


TL;DR: The above workaround is outdated. The real fix is:

  1. Do not use -XstartOnFirstThread - it is incompatible with bgfx
  2. Put this somewhere before your GLFW init:
if (Platform.get() == Platform.MACOSX) {
    Configuration.GLFW_LIBRARY_NAME.set("glfw_async");
}

unascribed avatar Jun 22 '22 20:06 unascribed

@unascribed I'm not sure I understand. We do have an NSApplication/main loop in Java; it is created by GLFW. The above fix is supposed to be used in the context of a properly configured GLFW application. Running with -XstartOnFirstThread should make everything (Java main thread, NSApp's main loop, bgfx calls) run in the main/first thread.

However, I haven't followed the progress of bgfx's Metal backend recently and might be missing important details. Could you please explain what exactly has changed?

Btw, applications using the glfw_async build are able to call GLFW functions from any thread. Cocoa calls are then internally dispatched to the main thread, asynchronously. Is that what bgfx does now too?

Spasi avatar Jun 23 '22 08:06 Spasi

I suppose I misunderstood something about GLFW, then; I was under the impression it does not set up the main loop — bgfx samples/etc create their own NSApplication.

bgfx started using libdispatch a while ago alongside its own semaphore system (rather than dispatch_sync, which detects this case and degrades to a direct call), meaning it deadlocks if called on-thread. The original workaround dodged the initial implementation of this code, as only the initialization of a layer if an NSWindow was provided used dispatch. In modern bgfx, the entire Metal/OpenGL initialization is contained in a dispatch.

This breaks the case of using -XstartOnFirstThread with a GLFW-managed main loop, as you will deadlock upon initializing bgfx — you can try to initialize your own thread and use that, but it is very difficult to impossible to call back into the main thread from Java as libdispatch is kind of a trash fire (e.g. there is no exported symbol for dispatch_get_main_queue - instead it's a symbol _dispatch_main_q which cannot be used via JNA as it's a global value, and even if you can retrieve it, the libdispatch header uses a macro that does... something) and CoreFoundation is no better (expects an Objective-C block, no C function variant), meaning you can never again call main-thread-only methods (such as glfwTerminate). Since the approach without -XstartOnFirstThread works though, that seems to me like a better solution in any case, as it avoids needing to pass an extra, obscure, and OS-specific JVM arg.

bgfx seems to contain a guard for this (it calls [NSThread isMainThread]), but it does not work here for whatever reason. I assumed the reason would be that the thread is not tagged as the main thread as it is not running the main event loop. And I suppose from a practical standpoint it's not even possible that GLFW would be managing this for you (at least not prior to calling glfwPollEvents in your own main loop) as then glfwInit would never return... I suppose this then calls to question if this would work with -XstartOnFirstThread if you were to initialize bgfx only after starting your own event loop? That would be extremely awkward, though, to say the least...

Using glfw_async from the JVM thread (i.e. without -XstartOnFirstThread) allows it to work, precisely because it's dispatching Cocoa calls to the main thread in the same fashion as bgfx. (or at least a similar fashion - there are at least 3 ways to call into the main thread on macOS) — it also makes the above layer workaround unnecessary.

I'll note this issue also affects the bgfx OpenGL backend on macOS now, and is no longer Metal-specific. (And it never had anything to do with macOS versions, just bgfx — I tested this on a macOS 10.13 "High Sierra" system.)


All of this (including my earlier post) is the result of banging my head against this for multiple days, trying to get my LWJGL3 + bgfx + NanoVG game to run under macOS. The glfw_async + no -XstartOnMainThread solution was the only one that actually worked. I tried just about every solution I could find or think of (including forking bgfx and removing places where it uses libdispatch as an attempt at debugging), and possibly came up with some bad assumptions along the way. My knowledge of macOS threading comes from writing iOS apps, and while iOS shares macOS's threading model for the most part, there are some differences.

unascribed avatar Jun 23 '22 08:06 unascribed

Following along here, adding some information in case this is helpful.

I have been using -XstartOnFirstThread with the earlier CAMetalLayer solution to run an app on macOS 12.3.1 with LWJGL 3.3.1. This is with bgfx metal (and I also use NanoVG). I'm running Eclipse Temurin 17.0.2+8 on an M1 aarch64.

I saw your post and tried running with glfw_async and dropping the -XstartOnFirstThread argument. When I run that way, my initialization goes fine but the program crashes shortly thereafter. My main loop is using glfwWaitEventsTimeout and bgfx_frame(false);.

Crash details are pasted below. This seems to be caused by my call to glfwWaitEventsTimeout - if I comment that out, then my window comes up and draws itself just fine. If I replace it with glfwPollEvents() - then it also seems to work fine.

How are you getting input events - are you using glfwPollEvents()? I suppose I could change to using this and implement timing another way (I had been using glfwWaitEventsTimeout because VSYNC did not seem to be working on earlier bgfx version - I have not tested again recently). But this crash makes me slightly nervous about dropping -XstartOnFirstThread in case there are other calls like glfwWaitEventsTimeout that will fail in this way. My general understanding had always been that it's safer to work with -XstartOnFirstThread when doing non-AWT Java that will interact with the UI layer.

The GLFW documentation for both glfwPollEvents() and glfwWaitEventsTimeout does say: Thread safety This function must only be called from the main thread.. Perhaps glfw_async should deal with that, but it doesn't seem to, as the crash below comes out of libglfw_async.dylib

If you want to reference my application setup code that works with -XstartOnFirstThread - see here: https://github.com/mcslee/GLX/blob/master/src/main/java/heronarts/glx/GLX.java https://github.com/mcslee/GLX/blob/master/src/main/java/heronarts/glx/GLX.java#L245 - this initializes the window https://github.com/mcslee/GLX/blob/master/src/main/java/heronarts/glx/GLX.java#L520 - this is my main event-polling + drawing loop https://github.com/mcslee/GLX/blob/master/src/main/java/heronarts/glx/GLX.java#L484 - platform-specific stuff

Thank you both @unascribed and @Spasi for putting time into this. I would also love to understand more deeply what the best long-term solution is here.

2022-06-23 12:01:44.711 java[11652:10458930] *** Terminating app due to uncaught exception 'NSInternalInconsistencyException', reason: 'nextEventMatchingMask should only be called from the Main Thread!'
*** First throw call stack:
(
	0   CoreFoundation                      0x000000019476ab08 __exceptionPreprocess + 240
	1   libobjc.A.dylib                     0x00000001944b5e14 objc_exception_throw + 60
	2   AppKit                              0x0000000197246f94 -[NSApplication(NSEvent) _nextEventMatchingEventMask:untilDate:inMode:dequeue:] + 4256
	3   libglfw_async.dylib                 0x0000000121bbd47c libglfw_async.dylib + 87164
	4   ???                                 0x000000010ab058ac 0x0 + 4474296492
	5   ???                                 0x000000010ab01fc8 0x0 + 4474281928
	6   ???                                 0x000000010ab01fc8 0x0 + 4474281928
	7   ???                                 0x000000010ab01fc8 0x0 + 4474281928
	8   ???                                 0x000000010ab01fc8 0x0 + 4474281928
	9   ???                                 0x000000010ab01fc8 0x0 + 4474281928
	10  ???                                 0x000000010aafc140 0x0 + 4474257728
	11  libjvm.dylib                        0x00000001024fe8f0 _ZN9JavaCalls11call_helperEP9JavaValueRK12methodHandleP17JavaCallArgumentsP10JavaThread + 908
	12  libjvm.dylib                        0x00000001025630e4 _ZL17jni_invoke_staticP7JNIEnv_P9JavaValueP8_jobject11JNICallTypeP10_jmethodIDP18JNI_ArgumentPusherP10JavaThread + 300
	13  libjvm.dylib                        0x0000000102566778 jni_CallStaticVoidMethod + 304
	14  libjli.dylib                        0x0000000100e99378 JavaMain + 2516
	15  libjli.dylib                        0x0000000100e9b6e8 ThreadJavaMain + 12
	16  libsystem_pthread.dylib             0x000000019461e26c _pthread_start + 148
	17  libsystem_pthread.dylib             0x000000019461908c thread_start + 8
)
libc++abi: terminating with uncaught exception of type NSException

mcslee avatar Jun 23 '22 19:06 mcslee

Hey @mcslee,

Turns out that glfwWaitEvents and glfwWaitEventsTimeout were not properly dispatched to the main thread with glfw_async. The next nightly build will contain the fix.

Until then, you may download the latest libglfw_async.dylib (arm64) to try it out (set Configuration.GLFW_LIBRARY_NAME to the downloaded file's path).

Spasi avatar Jun 24 '22 17:06 Spasi

Looks like this is still only released in nightly as 3.3.1 was released in feb 2022.

This is relevant also for libgdx where on v1.11.0 I currently exactly experience a freeze/no render on macbook pro M1 (12.5 monterey). (https://github.com/libgdx/libgdx/issues/6872) Debugging it lead me up to the call GLFW.glfwPollEvents(); where a native invocation never returns. However you mention only glfwWaitEvents and glfwWaitEventsTimeout was not properly dispatched on the main thread. So is it the same or might I experience a different issue?

Found in some related thread that org.lwjgl.system.Configuration.GLFW_LIBRARY_NAME.set("glfw_async"); might help however this immediately crashes when added to main with:

Process finished with exit code 133 (interrupted by signal 5: SIGTRAP)

judos avatar Aug 09 '22 12:08 judos

Hello everyone,

I did some experimenting related to the above and indeed, as @unascribed explained, bgfx assumes that it always runs from a secondary thread on macOS. The issue is not Metal-specific, it also applies to the OpenGL backend.

The first workaround is to use the glfw_async implementation, without -XstartOnFirstThread, then proceed as usual. The main Java thread (which is NOT the first/UI thread in this mode) can be used for both GLFW and bgfx code.

However, the normal GLFW implementation can also be used, with -XstartOnFirstThread, with only one requirement: bgfx must run on a secondary thread. The above commit (see the HelloBGFXMT demo) implements this approach. I adapted the code from the Vorbis demo, which also showcases multi-threaded event-loop/rendering.

Of the two approaches, I would highly recommend the multi-threaded one. It is how both GLFW and bgfx are supposed to be used. The bgfx examples also follow this approach.

Spasi avatar Mar 10 '23 16:03 Spasi