Android ANR Native lock contention: SDL_LockJoysticks (onNativePadDown) & native_release_interface
ANRs are far fewer than before, but there are still some rare occurrences.
Main thread (blocked) is doing this:
SDL_LockJoysticks
SDLControllerManager.onNativePadDown
SDLActivity.handleKeyEvent
SDLSurface.onKey
View.dispatchKeyEvent
ViewGroup.dispatchKeyEvent
DecorView.superDispatchKeyEvent
PhoneWindow.superDispatchKeyEvent
Activity.dispatchKeyEvent
SDLActivity.dispatchKeyEvent
DecorView.dispatchKeyEvent
...
Another SDL thread is doing this:
UsbDeviceConnection.native_release_interface
UsbDeviceConnection.releaseInterface
HIDDeviceUSB.close
HIDDeviceManager.closeDevice
...
Just to update this. The Joystick ANRs are much fewer now, which is a great improvement. Maybe we can close this if nothing more can be done?
I'm just seeing a few rare ones which look like this:
Main Java thread blocked: SDLSurface.onKey > SDLActivity.handleKeyEvent > SDLControllerManager.onNativePadDown > Java_org_libsdl_app_SDLControllerManager_onNativePadDown > Android_OnPadDown > SDL_LockJoysticks
Native thread: HIDDeviceManager.openDevice > HIDDeviceUSB.open > HIDDeviceUSB.close > UsbDeviceConnection.releaseInterface > UsbDeviceConnection.native_release_interface > android_hardware_UsbDeviceConnection_release_interface > usb_device_connect_kernel_driver > ioctl
or Native thread: HIDDeviceManager.closeDevice > HIDDeviceUSB.close > UsbDeviceConnection.releaseInterface > UsbDeviceConnection.native_release_interface
Okay, these look like legitimate hangs in the OS native_release_interface call, and there isn't anything we can do about them. I'll go ahead and close this, since it sounds like the SDL related hangs are fixed.
Thanks!
@AntTheAlchemist: I think I had the same ANR before and I personally use this patch on my project. also a "few ANR" depends on the number of users.
ANR rate is 0.10%, while peers ANR is 0.30% and android bad threshold is 0.47% Crash rate is 0.01%, peers 0.15% and threshold 1.09%
diff --git a/src/core/android/SDL_android.c b/src/core/android/SDL_android.c
index 33d7815ff..f8964da55 100644
--- a/src/core/android/SDL_android.c
+++ b/src/core/android/SDL_android.c
@@ -1120,7 +1136,17 @@ JNIEXPORT jboolean JNICALL SDL_JAVA_CONTROLLER_INTERFACE(onNativePadDown)(
jint device_id, jint keycode)
{
#ifdef SDL_JOYSTICK_ANDROID
- return Android_OnPadDown(device_id, keycode);
+ bool ret = false;
+
+ SDL_LockMutex(Android_ActivityMutex);
+
+ if (Android_Window) {
+ ret = Android_OnPadDown(device_id, keycode);
+ }
+
+ SDL_UnlockMutex(Android_ActivityMutex);
+
+ return ret;
#else
return false;
#endif // SDL_JOYSTICK_ANDROID
@@ -1132,7 +1158,17 @@ JNIEXPORT jboolean JNICALL SDL_JAVA_CONTROLLER_INTERFACE(onNativePadUp)(
jint device_id, jint keycode)
{
#ifdef SDL_JOYSTICK_ANDROID
- return Android_OnPadUp(device_id, keycode);
+ bool ret = false;
+
+ SDL_LockMutex(Android_ActivityMutex);
+
+ if (Android_Window) {
+ ret = Android_OnPadUp(device_id, keycode);
+ }
+
+ SDL_UnlockMutex(Android_ActivityMutex);
+
+ return ret;
#else
return false;
#endif // SDL_JOYSTICK_ANDROID
@@ -1144,7 +1180,13 @@ JNIEXPORT void JNICALL SDL_JAVA_CONTROLLER_INTERFACE(onNativeJoy)(
jint device_id, jint axis, jfloat value)
{
#ifdef SDL_JOYSTICK_ANDROID
- Android_OnJoy(device_id, axis, value);
+ SDL_LockMutex(Android_ActivityMutex);
+
+ if (Android_Window) {
+ Android_OnJoy(device_id, axis, value);
+ }
+
+ SDL_UnlockMutex(Android_ActivityMutex);
#endif // SDL_JOYSTICK_ANDROID
}
@@ -1154,7 +1196,13 @@ JNIEXPORT void JNICALL SDL_JAVA_CONTROLLER_INTERFACE(onNativeHat)(
jint device_id, jint hat_id, jint x, jint y)
{
#ifdef SDL_JOYSTICK_ANDROID
- Android_OnHat(device_id, hat_id, x, y);
+ SDL_LockMutex(Android_ActivityMutex);
+
+ if (Android_Window) {
+ Android_OnHat(device_id, hat_id, x, y);
+ }
+
+ SDL_UnlockMutex(Android_ActivityMutex);
#endif // SDL_JOYSTICK_ANDROID
}
diff --git a/src/joystick/SDL_joystick.c b/src/joystick/SDL_joystick.c
index 5f7734847..c4e2763cf 100644
--- a/src/joystick/SDL_joystick.c
+++ b/src/joystick/SDL_joystick.c
@@ -650,10 +650,17 @@ void SDL_LockJoysticks(void)
void SDL_UnlockJoysticks(void)
{
- bool last_unlock = false;
-
--SDL_joysticks_locked;
+#ifdef __ANDROID__
+ /* On Android, let's keep the mutex alive:
+ * Because the SDLActivity may be trying and waiting in SDL_LockJoysticks(),
+ * while we are about to destroying it.
+ * (this would trigger "pthread_mutex_lock called on a destroyed mutex")
+ */
+ SDL_UnlockMutex(SDL_joystick_lock);
+ /* Don't call SDL_DestroyMutex() */
+#else
if (!SDL_joysticks_initialized) {
// NOTE: There's a small window here where another thread could lock the mutex after we've checked for pending locks
if (!SDL_joysticks_locked && SDL_GetAtomicInt(&SDL_joystick_lock_pending) == 0) {
@@ -678,6 +685,7 @@ void SDL_UnlockJoysticks(void)
} else {
SDL_UnlockMutex(SDL_joystick_lock);
}
+#endif
}
@1bsyl that's an interesting patch. Would SDL benefit from doing that?
My crash rate is 0.02%. ANR rate 0.21% which is much better than before, but still a bit close to the bad behaviour threshold. Twice as bad as your results. Android TV is what pushes mine up.
I think the patch isn't ok for some specific usage of Sam's apps. but for usual thing, that's ok, I've used it for years. (related issue: #7811)
@AntTheAlchemist, does @1bsyl's patch help?
@1bsyl, do you need the Android_Window check in there? What happens if you omit that?
@AntTheAlchemist, does @1bsyl's patch help?
It will take time to find out, as I need to upload and watch the reports on Play Console, but won't be 100% sure if something else has influenced it in that time. I'll give it a try, but @1bsyl seems to have a better grasp of what's going on.
Yes, please let me know. Thanks!
@sam The Android_Window check doesn't seem mandatory on its own. the variable Android_window isn't used. we just check that the window has been created and is not destroyed (nor in creating or destroying phase). maybe that avoids sending the PadUp events at bad times.
but here, the first message seems to be the SDL_LockJoystick().
I remember this issue and this was partially fixed, see prev discussion and the comment
// NOTE: There's a small window here where another thread could lock the mutex after we've checked for pending locks
I remember some real time OS with some "inversion of priority". or some scheduling feature. like: thread A: wait for mutex. thread B: is unlocking the mutex. os: as soon as thread B unlock the mutex, the scheduler, push thread A in running state.
that would definitively help the bug to occur. ... but that would be pthread_mutex_lock called on a destroyed mutex
maybe, here, we should see which is the other thread have the lock
@1bsyl I'm not sure how to apply this patch. Is there a PR for it?
@AntTheAlchemist copy the text into a file: foo.txt
cd SDL
patch -p1 < foo.txt
@1bsyl I have no patch command on my system (Windows 10, no Github). I'll figure something out.
Do we need to lock/unlock if Android_Window is null?
I would remove the Android_Window checks entirely, just for testing.
I get "error C2065: 'last_unlock': undeclared identifier" if not compiling for Android, but I kind of get the idea. Ready to test in production. Off to Morocco, back in 11 nights!
Off to Morocco, back in 11 nights!
Have fun!
Hey @AntTheAlchemist , do you have any news on this ?
Hi @1bsyl . I was still seeing these ANRs while I was on vacation, with no change, which made me suspect my new release had missed something. I've uploaded a couple more releases since and am still observing. It's either cured, or it's too early to say. So far I see just one ANR in HIDDeviceUSB.close, which seems unrelated. I need more time to get reports from the latest release.
Okay, time has passed and I'm seeing the odd ANR and crash reports now. I'm very confused because: I forgot to include the patched code in my latest release and I am not seeing any gamepad / joystick related ANRs now. I see them in older versions going back years, but not in my latest releases since trying out the patch and subsequently forgetting patch. Either it's a fluke, or something else changed to fix it?.
TL;DR: Without the patch, all is okay with this bug now. It seems to have fixed itself 👻
@AntTheAlchemist, some anr/crash are weird and maybe only side effect of heap corruption.
what you can test I guess:
- without the patch.
- put a SDL_Delay after the Renderer/Window destruction.
- play with the hat, to trigger Android_OnHat() function to see if it crashes...