Detox icon indicating copy to clipboard operation
Detox copied to clipboard

Signal 11 crash still present after SIMCTL_CHILD_NSZombieEnabled=YES + latest Firebase version

Open shamilovtim opened this issue 2 years ago • 47 comments

Description

A clear and concise description of what the bug is.

  • [x] I have tested this issue on the latest Detox release and it still reproduces

Reproduction

We still see Signal 11 crashes in our project even after adding both SIMCTL_CHILD_NSZombieEnabled=YES detox test -c ios.sim.release as well as updating firebase + firebase perf to the latest versions and ensuring that GoogleUtilities are at 7.7+.

The error is commonly reproduced on CircleCI and is incredibly difficult / practically impossible to reproduce locally.

In order to help debug I've provided a list of all of the dependencies in our project. We use Expo 43 (no expo updates), React Native 0.65.2 and Codepush (probably unrelated, the error happens in the middle of a test, not at the beginning). I've found that the Stripe iOS SDK (via tipsi-stripe) is a source of method swizzling. Also, while Flipper is a source of method swizzling, we are building a release project so Flipper should be excluded.

Let me know if there's anything else I can provide to help debug and resolve this issue.

The stacktrace is:

DetoxRuntimeError: The pending request #99 ("invoke") has been rejected due to the following error:

The app has crashed, see the details below:

Signal 11 was raised
(
	0   Detox                               0x00000001127a56a5 +[NSThread(DetoxUtils) dtx_demangledCallStackSymbols] + 37
	1   Detox                               0x00000001127a8230 __DTXHandleCrash + 464
	2   Detox                               0x00000001127a8971 __DTXHandleSignal + 59
	3   libsystem_platform.dylib            0x0000000118ef1d7d _sigtramp + 29
	4   ???                                 0x00006000162b0180 0x0 + 105553488183680
	5   DetoxSync                           0x0000000154151b37 -[_DTXTimerTrampoline fire:] + 188
	6   DetoxSync                           0x0000000154139b70 _DTXCFTimerTrampoline + 74
	7   CoreFoundation                      0x00000001156416b6 __CFRUNLOOP_IS_CALLING_OUT_TO_A_TIMER_CALLBACK_FUNCTION__ + 20
	8   CoreFoundation                      0x00000001156411b2 __CFRunLoopDoTimer + 923
	9   CoreFoundation                      0x0000000115640771 __CFRunLoopDoTimers + 265
	10  CoreFoundation                      0x000000011563adb0 __CFRunLoopRun + 2010
	11  CoreFoundation                      0x000000011563a0f3 CFRunLoopRunSpecific + 567
	12  Shipt                               0x000000010ff87c74 +[RCTCxxBridge runRunLoop] + 281
	13  DetoxSync                           0x000000015414173a swz_runRunLoopThread + 291
	14  Foundation                          0x0000000113539550 __NSThread__start__ + 1025
	15  libsystem_pthread.dylib             0x0000000118e8a8fc _pthread_start + 224
	16  libsystem_pthread.dylib             0x0000000118e86443 thread_start + 15
)
    at _callee4$ (/Users/distiller/nebula/e2e/helpers/helpers.ts:22:42)
    at tryCatch (/Users/distiller/nebula/node_modules/regenerator-runtime/runtime.js:63:40)

List of dependencies present in the project:

	"dependencies": {
		"@babel/preset-typescript": "7.16.7",
		"@bugsnag/react-native": "7.15.1",
		"@formatjs/intl-datetimeformat": "^2.8.3",
		"@formatjs/intl-numberformat": "^5.7.5",
		"@oracle/react-native-pushiomanager": "https://github.com/shamilovtim/pushiomanager-react-native#shipt-next",
		"@react-native-community/async-storage": "^1.6.2",
		"@react-native-community/geolocation": "^2.0.2",
		"@react-native-community/hooks": "2.8.1",
		"@react-native-community/push-notification-ios": "^1.10.1",
		"@react-native-firebase/app": "14.2.4",
		"@react-native-firebase/perf": "14.2.4",
		"@react-navigation/bottom-tabs": "^6.0.9",
		"@react-navigation/native": "^6.0.6",
		"@react-navigation/routers": "^6.1.0",
		"@react-navigation/stack": "^6.0.11",
		"@segment/analytics-react-native": "^1.5.0",
		"@segment/analytics-react-native-firebase": "^1.5.0",
		"accounting-js": "^1.1.1",
		"concurrently": "6.5.1",
		"core-js": "3.20.1",
		"date-fns": "2.7.0",
		"date-fns-tz": "1.0.12",
		"expo": "43.0.4",
		"expo-apple-authentication": "^4.1.0",
		"expo-facebook": "12.0.3",
		"expo-google-sign-in": "10.0.3",
		"expo-linear-gradient": "10.0.3",
		"expo-secure-store": "^11.1.0",
		"expo-tracking-transparency": "2.0.3",
		"graphql": "^15.7.0",
		"graphql-request": "3.7.0",
		"imagemin": "^8.0.1",
		"immer": "9.0.12",
		"inquirer": "^8.2.0",
		"intl": "^1.2.5",
		"invariant": "2.2.4",
		"jsdom": "19.0.0",
		"lodash": "^4.17.21",
		"lottie-ios": "^3.2.3",
		"lottie-react-native": "4.1.3",
		"mailosaur": "^8.1.0",
		"medallia-digital": "https://repository.medallia.com/digital-npm/medallia-digital-rn/medallia-digital-rn-3.8.1.tgz",
		"multi-progress": "4.0.0",
		"nano-memoize": "1.2.1",
		"node-emoji": "^1.11.0",
		"normalizr": "^3.6.1",
		"prop-types": "15.8.0",
		"query-string": "7.0.1",
		"react": "17.0.2",
		"react-native": "0.65.2",
		"react-native-action-sheet": "2.2.0",
		"react-native-android-open-settings": "1.3.0",
		"react-native-branch": "5.3.0",
		"react-native-camera": "4.2.1",
		"react-native-code-push": "^7.0.4",
		"react-native-device-info": "8.4.8",
		"react-native-easing-gradient": "^1.0.0",
		"react-native-fs": "^2.18.0",
		"react-native-geolocation-service": "5.2.0",
		"react-native-gesture-handler": "1.10.3",
		"react-native-get-random-values": "1.7.2",
		"react-native-haptic-feedback": "1.13.0",
		"react-native-html-parser": "^0.1.0",
		"react-native-in-app-review": "^3.2.3",
		"react-native-keyboard-aware-scroll-view": "0.9.5",
		"react-native-mmkv": "1.6.3",
		"react-native-modal": "13.0.0",
		"react-native-progress": "5.0.0",
		"react-native-push-notification": "8.1.1",
		"react-native-reanimated": "1.13.1",
		"react-native-redash": "^15.6.0",
		"react-native-render-html": "4.2.4",
		"react-native-safe-area-context": "3.3.2",
		"react-native-screens": "3.10.1",
		"react-native-share": "7.3.2",
		"react-native-svg": "^12.1.1",
		"react-native-v8": "0.65.1-patch.1",
		"react-native-webview": "11.15.0",
		"react-query": "3.34.6",
		"react-redux": "^7.2.6",
		"reactotron-redux": "^3.1.3",
		"redux": "4.1.2",
		"redux-persist": "^6.0.0",
		"redux-saga": "^1.1.3",
		"reselect": "4.1.5",
		"shelljs": "^0.8.5",
		"sift-react-native": "^0.1.3",
		"tipsi-stripe": "adkenyon/tipsi-stripe#5cab5f303bedc5ffd3fb543b14dc229f5259b0a0",
		"uri-js": "4.4.1",
		"uuid": "^8.3.2",
		"v8-android-jit-nointl": "9.93.0"
	}

Expected behavior

Expect no signal 11

Screenshots / Video

Not necessary right now, will add later if needed.

Environment

  • Detox: 19.4.2
  • React Native: 0.65.2
  • Node: 14.17.3
  • Device: iOS Emulator
  • Xcode: 13.2.1
  • iOS: latest
  • macOS: 11.6
  • Test-runner (select one): jest-circus

Logs

Device and verbose Detox logs

  • [x] I have run my tests using the --loglevel trace argument and am providing the verbose log below:
Detox logs The stacktrace is:
DetoxRuntimeError: The pending request #99 ("invoke") has been rejected due to the following error:

The app has crashed, see the details below:

Signal 11 was raised
(
	0   Detox                               0x00000001127a56a5 +[NSThread(DetoxUtils) dtx_demangledCallStackSymbols] + 37
	1   Detox                               0x00000001127a8230 __DTXHandleCrash + 464
	2   Detox                               0x00000001127a8971 __DTXHandleSignal + 59
	3   libsystem_platform.dylib            0x0000000118ef1d7d _sigtramp + 29
	4   ???                                 0x00006000162b0180 0x0 + 105553488183680
	5   DetoxSync                           0x0000000154151b37 -[_DTXTimerTrampoline fire:] + 188
	6   DetoxSync                           0x0000000154139b70 _DTXCFTimerTrampoline + 74
	7   CoreFoundation                      0x00000001156416b6 __CFRUNLOOP_IS_CALLING_OUT_TO_A_TIMER_CALLBACK_FUNCTION__ + 20
	8   CoreFoundation                      0x00000001156411b2 __CFRunLoopDoTimer + 923
	9   CoreFoundation                      0x0000000115640771 __CFRunLoopDoTimers + 265
	10  CoreFoundation                      0x000000011563adb0 __CFRunLoopRun + 2010
	11  CoreFoundation                      0x000000011563a0f3 CFRunLoopRunSpecific + 567
	12  Shipt                               0x000000010ff87c74 +[RCTCxxBridge runRunLoop] + 281
	13  DetoxSync                           0x000000015414173a swz_runRunLoopThread + 291
	14  Foundation                          0x0000000113539550 __NSThread__start__ + 1025
	15  libsystem_pthread.dylib             0x0000000118e8a8fc _pthread_start + 224
	16  libsystem_pthread.dylib             0x0000000118e86443 thread_start + 15
)
    at _callee4$ (/Users/distiller/nebula/e2e/helpers/helpers.ts:22:42)
    at tryCatch (/Users/distiller/nebula/node_modules/regenerator-runtime/runtime.js:63:40)

shamilovtim avatar Feb 05 '22 01:02 shamilovtim

Hi @shamilovtim, thanks for the report!

Try replacing NSZombieEnabled=YES detox test -c ios.sim.release with SIMCTL_CHILD_NSZombieEnabled=YES detox test -c ios.sim.release

Also, note that the flag that we pass as a fix for the Firebase-related Signal 11 deallocation errors is GULGeneratedClassDisposeDisabled=YES, which replaced the NSZombieEnabled flag that was passed from Detox before it, (see https://github.com/wix/Detox/pull/3167, GoogleUtilities/#66).

asafkorem avatar Feb 05 '22 21:02 asafkorem

Hey @asafkorem! Just confirmed that we do pass SIMCTL_CHILD_NSZombieEnabled=YES correctly, I had just typed it wrong on the Github ticket. I updated the original ticket to reflect that. Screen Shot 2022-02-05 at 5 12 05 PM

I know that you're passing the more specific flag inside of Detox itself but I was hoping that the Zombie flag might deal with any other method swizzling related crashes. Unfortunately the crashes were still present so the more general Zombie flag did not work. Passing no flag doesn't work either.

shamilovtim avatar Feb 05 '22 22:02 shamilovtim

@shamilovtim ah okay, I see.. Was it working with the zombie-enable workaround before upgrading Detox to the latest version?

asafkorem avatar Feb 06 '22 15:02 asafkorem

  • On December 17 I added SIMCTL_CHILD_NSZombieEnabled=YES to yarn detox test -c ci.ios.release in our project.
  • On December 29 I introduced Detox 19.4.1 to the project, which is supposed to include this fix inside detox instead of me declaring it as an env var and removed SIMCTL_CHILD_NSZombieEnabled=YES from detox test -c. In between this time we still saw signal 11 in the project.
  • After December 29, we continued to see Signal 11.
  • On January 25, we upgraded to Detox 19.4.2 and after that continued to receive signal 11 in the project.

No matter what version of Detox or whether or not we apply SIMCTL_CHILD_NSZombieEnabled=YES Signal 11 continues to live in our project. We used to see it way more often in the past, so it has gone down in frequency. I assume that's because the Firebase fix did remove some instances of it but the other ones are still occurring.

shamilovtim avatar Feb 07 '22 22:02 shamilovtim

@shamilovtim this sounds like the current Signal 11 errors are more Expo related (and aren't coming from Firebase), based on some previous reports and my gut feeling...

Also, can you tell me if you have some other dependency that uses swizzling in iOS?

That would be hard to investigate based on the logs only, so we must have some project I can work with that reproduces this issue. I can try and invest some time soon in testing an Expo + Detox project with the Expo version you have, but if you have something ready that reproduces the crash on the latest Detox version and you can send it me this would be very helpful.

asafkorem avatar Feb 08 '22 11:02 asafkorem

@asafkorem In my opinion the suspects are:

Expo (we don't use expo updates), Stripe (v14 stripe-ios via tipsi stripe), Bugsnag, Codepush, Segment Analytics

Stripe iOS (v14) for sure has method swizzling. I was able to find it by scanning the source code of the Pod.

Flipper uses method swizzling but will not be active during Release builds. (However, I do wonder if Flipper would crash detox debug)?

Excuse my ignorance but is there any way of making Detox play nice with any and all method swizzling rather than the Firebase workaround created in the Google repo?

shamilovtim avatar Feb 08 '22 16:02 shamilovtim

@shamilovtim the problem in Firebase/Performance wasn't in Detox end but in their code (it pretty much assumed that no other framework does dynamic ISA-Swizzling for iOS classes), so it's pretty hard to assume that there is a general solution to the problem from Detox' code.

However, once we'll finish with Detox-iOS transition to XCUITest framework, we may be able to avoid doing ISA swizzling for iOS classes so hopefully it will resolve such issues.

asafkorem avatar Feb 13 '22 09:02 asafkorem

@asafkorem In my opinion the suspects are:

Expo (we don't use expo updates), Stripe (v14 stripe-ios via tipsi stripe), Bugsnag, Codepush, Segment Analytics

Stripe iOS (v14) for sure has method swizzling. I was able to find it by scanning the source code of the Pod.

Flipper uses method swizzling but will not be active during Release builds. (However, I do wonder if Flipper would crash detox debug)?

Excuse my ignorance but is there any way of making Detox play nice with any and all method swizzling rather than the Firebase workaround created in the Google repo?

These definitely sound suspicious 😅 It will be very helpful for us if you'll run DetoxTemplate (or any example RN project) with Detox tests and any of your dependencies that might cause this error, and try to reproduce the bug. I know it might take a lot of time, but it is also hard for me to reproduce this issue with other libraries or frameworks I'm not familiar with.

asafkorem avatar Feb 13 '22 09:02 asafkorem

We're seeing this, I can maybe narrow down a bit as we intersect with some of those dependencies.

We're using bits of Expo and CodePush. (We don't use Stripe, Bugsnag, or Segment Analytics)

We also use reanimated (which I see in your dependency list) and our tests do seem to fail with Signal 11s around about places where there are animations - would that match up with what you're seeing @shamilovtim?

ball-hayden avatar Mar 09 '22 15:03 ball-hayden

We can't really nail down the cause. What version reanimated are you on?

shamilovtim avatar Mar 28 '22 14:03 shamilovtim

We're running react-native-reanimated: ^2.3.1

ball-hayden avatar Mar 28 '22 15:03 ball-hayden

We're on 1.13.1. Makes me skeptical it's reanimated. Could it just be Expo in some way?

shamilovtim avatar Mar 28 '22 15:03 shamilovtim

Possibly, although I can't help but feel it might be animation-timing related given the places we see this are always slightly after an animation has completed.

The places we are seeing this aren't places where we're interacting with Expo directly.

I've tried creating a project based on the Detox Repro Template that has Expo and reanimated in it but I haven't managed to get it to crash yet.

Something I have noticed is that the amount of time it takes Detox to "react" (i.e. continue to the next command) to an animation finishing seems to vary somewhat - see the video below at 19 seconds vs the other repetitions from my attempted reproduction:

https://user-images.githubusercontent.com/1596233/160562283-93f25e5b-da13-4706-bc16-4d654b5b66d6.mov

I also wondered if it might be a race between an animation finishing and a request completing, although again haven't managed to coax the reproduction into crashing.

ball-hayden avatar Mar 29 '22 07:03 ball-hayden

Possibly, although I can't help but feel it might be animation-timing related given the places we see this are always slightly after an animation has completed.

The places we are seeing this aren't places where we're interacting with Expo directly.

I've tried creating a project based on the Detox Repro Template that has Expo and reanimated in it but I haven't managed to get it to crash yet.

Something I have noticed is that the amount of time it takes Detox to "react" (i.e. continue to the next command) to an animation finishing seems to vary somewhat - see the video below at 19 seconds vs the other repetitions from my attempted reproduction:

https://user-images.githubusercontent.com/1596233/160562283-93f25e5b-da13-4706-bc16-4d654b5b66d6.mov

I also wondered if it might be a race between an animation finishing and a request completing, although again haven't managed to coax the reproduction into crashing.

Are you using the template in CI? Do you have your code up somewhere so i can pull it or add to it?

We only get these crashes in CI fyi

shamilovtim avatar Mar 29 '22 12:03 shamilovtim

We only get these crashes in CI

^ that is a point. I don't think I've ever seen it locally.

Fighting some other fires, but I'll get the template pushed somewhere when I can.

ball-hayden avatar Mar 29 '22 16:03 ball-hayden

@asafkorem I wonder if we can add reanimated, gesture handler, expo and react navigation to the official Detox e2e tests? By adding all of the common dependencies it might help to catch some of this crashing. What CI provider do you guys use for the project?

shamilovtim avatar Mar 29 '22 19:03 shamilovtim

@shamilovtim

What CI provider do you guys use for the project?

Buildkite + Macstadium

noomorph avatar Mar 30 '22 07:03 noomorph

@shamilovtim sorry for the delay.

https://github.com/PlayerData/DetoxReanimatedRepro

Just trying to set up GitHub actions to see if that might result in a reproduction.

ball-hayden avatar Apr 08 '22 12:04 ball-hayden

@shamilovtim can I ask, do you run multiple simulators at once in CI (as in, do you use the --max-workers flag)?

ball-hayden avatar Apr 11 '22 10:04 ball-hayden

Running https://github.com/PlayerData/DetoxReanimatedRepro/pull/4 locally, Detox is consistently not waiting for animations to finish for some workers (between 1 and 3).

It doesn't, however, Signal 11

ball-hayden avatar Apr 11 '22 11:04 ball-hayden

To brain dump a bit more - the stack trace takes us to here (specifically, the block):

https://github.com/wix/DetoxSync/blob/f5cd8cbde60311bb8e41df7c773850cd73c8f84b/DetoxSync/DetoxSync/Utils/_DTXTimerTrampoline.m#L120

Could fire be running for a timer that's been disposed, leading to mode or timer being invalid by the time we execute the block?

Sorry to ping @LeoNatan, but I'm waaay out of my depth here - would you have any ideas around this space?

ball-hayden avatar Apr 11 '22 11:04 ball-hayden

@shamilovtim can I ask, do you run multiple simulators at once in CI (as in, do you use the --max-workers flag)?

No we don't. CircleCI doesn't allow hardware acceleration so we are limited performance wise

shamilovtim avatar Apr 11 '22 13:04 shamilovtim

To brain dump a bit more - the stack trace takes us to here (specifically, the block):

https://github.com/wix/DetoxSync/blob/f5cd8cbde60311bb8e41df7c773850cd73c8f84b/DetoxSync/DetoxSync/Utils/_DTXTimerTrampoline.m#L120

Could fire be running for a timer that's been disposed, leading to mode or timer being invalid by the time we execute the block?

Sorry to ping @LeoNatan, but I'm waaay out of my depth here - would you have any ideas around this space?

I wonder if this has something to do with software rendered, slow CI VMs? Maybe the VM is so slow that timer times out and gets GCCed or whatever like you said? Which is why we don't see this on Detox running on bare metal instances

shamilovtim avatar Apr 11 '22 15:04 shamilovtim

Our CI doesn't use VMs but instead a pair of slightly ageing Mac Minis (2018) so the same theory holds.

ball-hayden avatar Apr 11 '22 15:04 ball-hayden

Couple of patches I tried yesterday.

Firstly, this change which guards against the returned proxy being null.

diff --git a/node_modules/detox/DetoxSync/DetoxSync/DetoxSync/Spies/CADisplayLink+DTXSpy.m b/node_modules/detox/DetoxSync/DetoxSync/DetoxSync/Spies/CADisplayLink+DTXSpy.m
index 023eabc..c296644 100644
--- a/node_modules/detox/DetoxSync/DetoxSync/DetoxSync/Spies/CADisplayLink+DTXSpy.m
+++ b/node_modules/detox/DetoxSync/DetoxSync/DetoxSync/Spies/CADisplayLink+DTXSpy.m
@@ -102,13 +102,16 @@ extern atomic_cfrunloop __RNRunLoop;
 	if(self.isPaused != paused)
 	{
 		id<DTXTimerProxy> proxy = [DTXTimerSyncResource existingTimerProxyWithDisplayLink:self create:NO];
-		if(paused == YES)
-		{
-			[proxy untrack];
-		}
-		else
-		{
-			[self _detox_sync_trackIfNeeded];
+
+		if(proxy) {
+			if(paused == YES)
+			{
+				[proxy untrack];
+			}
+			else
+			{
+				[self _detox_sync_trackIfNeeded];
+			}
 		}
 	}

This didn't help, which kinda makes sense - there should always be a proxy at this point if I've understood properly.

The more interesting change is this one:

diff --git a/node_modules/detox/DetoxSync/DetoxSync/DetoxSync/Spies/NSTimer+DTXSpy.m b/node_modules/detox/DetoxSync/DetoxSync/DetoxSync/Spies/NSTimer+DTXSpy.m
index 8f5549c..9a82b9e 100644
--- a/node_modules/detox/DetoxSync/DetoxSync/DetoxSync/Spies/NSTimer+DTXSpy.m
+++ b/node_modules/detox/DetoxSync/DetoxSync/DetoxSync/Spies/NSTimer+DTXSpy.m
@@ -78,7 +78,7 @@ CFRunLoopTimerRef __detox_sync_CFRunLoopTimerCreateWithHandler(CFAllocatorRef al
 static void (*__orig_CFRunLoopAddTimer)(CFRunLoopRef rl, CFRunLoopTimerRef timer, CFRunLoopMode mode);
 void __detox_sync_CFRunLoopAddTimer(CFRunLoopRef rl, CFRunLoopTimerRef timer, CFRunLoopMode mode)
 {
-//	NSLog(@"🤦‍♂️ addTimer: %@", NS(timer));
+	NSLog(@"🤦‍♂️ addTimer: %@", NS(timer));

 	id<DTXTimerProxy> trampoline = [DTXTimerSyncResource existingTimerProxyWithTimer:NS(timer)];
 	trampoline.runLoop = rl;
@@ -91,7 +91,7 @@ void __detox_sync_CFRunLoopAddTimer(CFRunLoopRef rl, CFRunLoopTimerRef timer, CF
 static void (*__orig_CFRunLoopRemoveTimer)(CFRunLoopRef rl, CFRunLoopTimerRef timer, CFRunLoopMode mode);
 void __detox_sync_CFRunLoopRemoveTimer(CFRunLoopRef rl, CFRunLoopTimerRef timer, CFRunLoopMode mode)
 {
-//	NSLog(@"🤦‍♂️ removeTimer: %@", NS(timer));
+	NSLog(@"🤦‍♂️ removeTimer: %@", NS(timer));

 	id<DTXTimerProxy> trampoline = [DTXTimerSyncResource existingTimerProxyWithTimer:NS(timer)];
 	[trampoline untrack];
@@ -102,7 +102,7 @@ void __detox_sync_CFRunLoopRemoveTimer(CFRunLoopRef rl, CFRunLoopTimerRef timer,
 static void (*__orig_CFRunLoopTimerInvalidate)(CFRunLoopTimerRef timer);
 void __detox_sync_CFRunLoopTimerInvalidate(CFRunLoopTimerRef timer)
 {
-//	NSLog(@"🤦‍♂️ invalidate: %@", NS(timer));
+	NSLog(@"🤦‍♂️ invalidate: %@", NS(timer));

 	id<DTXTimerProxy> trampoline = [DTXTimerSyncResource existingTimerProxyWithTimer:NS(timer)];
 	[trampoline untrack];
@@ -113,7 +113,7 @@ void __detox_sync_CFRunLoopTimerInvalidate(CFRunLoopTimerRef timer)
 static void (*__orig___NSCFTimer_invalidate)(NSTimer* timer);
 void __detox_sync___NSCFTimer_invalidate(NSTimer* timer)
 {
-	//	NSLog(@"🤦‍♂️ invalidate: %@", timer);
+    NSLog(@"🤦‍♂️ invalidate: %@", timer);

 	id<DTXTimerProxy> trampoline = [DTXTimerSyncResource existingTimerProxyWithTimer:timer];
 	[trampoline untrack];
diff --git a/node_modules/detox/DetoxSync/DetoxSync/DetoxSync/Utils/_DTXTimerTrampoline.m b/node_modules/detox/DetoxSync/DetoxSync/DetoxSync/Utils/_DTXTimerTrampoline.m
index 5235fdc..b66d945 100644
--- a/node_modules/detox/DetoxSync/DetoxSync/DetoxSync/Utils/_DTXTimerTrampoline.m
+++ b/node_modules/detox/DetoxSync/DetoxSync/DetoxSync/Utils/_DTXTimerTrampoline.m
@@ -162,6 +162,8 @@ const void* __DTXTimerTrampolineKey = &__DTXTimerTrampolineKey;
 		return;
 	}

+    NSLog(@"🤦‍♂️ track: %@", _timer);
+	
 	_tracking = YES;
 	[DTXTimerSyncResource.sharedInstance trackTimerTrampoline:self];
 }
@@ -173,7 +175,7 @@ const void* __DTXTimerTrampolineKey = &__DTXTimerTrampolineKey;
 		return;
 	}

-	//	NSLog(@"🤦‍♂️ untrack: %@", _timer);
+    NSLog(@"🤦‍♂️ untrack: %@", _timer);

 	[DTXTimerSyncResource.sharedInstance untrackTimerTrampoline:self];
 	_tracking = NO;

All this change is doing is re-enabling some of the commented out logging (and adding a little bit extra).

With this change, I wasn't able to reproduce a segfault over around 10 runs (usually, I would have seen at least one if not two in that space).

Presumably, adding the logging is causing a reference to the timer to be retained for longer than it would be otherwise. Either that, or the extra slowness of logging makes the race no longer happen.

ball-hayden avatar Apr 12 '22 07:04 ball-hayden

Couple of patches I tried yesterday.

Firstly, this change which guards against the returned proxy being null.

diff --git a/node_modules/detox/DetoxSync/DetoxSync/DetoxSync/Spies/CADisplayLink+DTXSpy.m b/node_modules/detox/DetoxSync/DetoxSync/DetoxSync/Spies/CADisplayLink+DTXSpy.m
index 023eabc..c296644 100644
--- a/node_modules/detox/DetoxSync/DetoxSync/DetoxSync/Spies/CADisplayLink+DTXSpy.m
+++ b/node_modules/detox/DetoxSync/DetoxSync/DetoxSync/Spies/CADisplayLink+DTXSpy.m
@@ -102,13 +102,16 @@ extern atomic_cfrunloop __RNRunLoop;
 	if(self.isPaused != paused)
 	{
 		id<DTXTimerProxy> proxy = [DTXTimerSyncResource existingTimerProxyWithDisplayLink:self create:NO];
-		if(paused == YES)
-		{
-			[proxy untrack];
-		}
-		else
-		{
-			[self _detox_sync_trackIfNeeded];
+
+		if(proxy) {
+			if(paused == YES)
+			{
+				[proxy untrack];
+			}
+			else
+			{
+				[self _detox_sync_trackIfNeeded];
+			}
 		}
 	}

This didn't help, which kinda makes sense - there should always be a proxy at this point if I've understood properly.

The more interesting change is this one:

diff --git a/node_modules/detox/DetoxSync/DetoxSync/DetoxSync/Spies/NSTimer+DTXSpy.m b/node_modules/detox/DetoxSync/DetoxSync/DetoxSync/Spies/NSTimer+DTXSpy.m
index 8f5549c..9a82b9e 100644
--- a/node_modules/detox/DetoxSync/DetoxSync/DetoxSync/Spies/NSTimer+DTXSpy.m
+++ b/node_modules/detox/DetoxSync/DetoxSync/DetoxSync/Spies/NSTimer+DTXSpy.m
@@ -78,7 +78,7 @@ CFRunLoopTimerRef __detox_sync_CFRunLoopTimerCreateWithHandler(CFAllocatorRef al
 static void (*__orig_CFRunLoopAddTimer)(CFRunLoopRef rl, CFRunLoopTimerRef timer, CFRunLoopMode mode);
 void __detox_sync_CFRunLoopAddTimer(CFRunLoopRef rl, CFRunLoopTimerRef timer, CFRunLoopMode mode)
 {
-//	NSLog(@"🤦‍♂️ addTimer: %@", NS(timer));
+	NSLog(@"🤦‍♂️ addTimer: %@", NS(timer));

 	id<DTXTimerProxy> trampoline = [DTXTimerSyncResource existingTimerProxyWithTimer:NS(timer)];
 	trampoline.runLoop = rl;
@@ -91,7 +91,7 @@ void __detox_sync_CFRunLoopAddTimer(CFRunLoopRef rl, CFRunLoopTimerRef timer, CF
 static void (*__orig_CFRunLoopRemoveTimer)(CFRunLoopRef rl, CFRunLoopTimerRef timer, CFRunLoopMode mode);
 void __detox_sync_CFRunLoopRemoveTimer(CFRunLoopRef rl, CFRunLoopTimerRef timer, CFRunLoopMode mode)
 {
-//	NSLog(@"🤦‍♂️ removeTimer: %@", NS(timer));
+	NSLog(@"🤦‍♂️ removeTimer: %@", NS(timer));

 	id<DTXTimerProxy> trampoline = [DTXTimerSyncResource existingTimerProxyWithTimer:NS(timer)];
 	[trampoline untrack];
@@ -102,7 +102,7 @@ void __detox_sync_CFRunLoopRemoveTimer(CFRunLoopRef rl, CFRunLoopTimerRef timer,
 static void (*__orig_CFRunLoopTimerInvalidate)(CFRunLoopTimerRef timer);
 void __detox_sync_CFRunLoopTimerInvalidate(CFRunLoopTimerRef timer)
 {
-//	NSLog(@"🤦‍♂️ invalidate: %@", NS(timer));
+	NSLog(@"🤦‍♂️ invalidate: %@", NS(timer));

 	id<DTXTimerProxy> trampoline = [DTXTimerSyncResource existingTimerProxyWithTimer:NS(timer)];
 	[trampoline untrack];
@@ -113,7 +113,7 @@ void __detox_sync_CFRunLoopTimerInvalidate(CFRunLoopTimerRef timer)
 static void (*__orig___NSCFTimer_invalidate)(NSTimer* timer);
 void __detox_sync___NSCFTimer_invalidate(NSTimer* timer)
 {
-	//	NSLog(@"🤦‍♂️ invalidate: %@", timer);
+    NSLog(@"🤦‍♂️ invalidate: %@", timer);

 	id<DTXTimerProxy> trampoline = [DTXTimerSyncResource existingTimerProxyWithTimer:timer];
 	[trampoline untrack];
diff --git a/node_modules/detox/DetoxSync/DetoxSync/DetoxSync/Utils/_DTXTimerTrampoline.m b/node_modules/detox/DetoxSync/DetoxSync/DetoxSync/Utils/_DTXTimerTrampoline.m
index 5235fdc..b66d945 100644
--- a/node_modules/detox/DetoxSync/DetoxSync/DetoxSync/Utils/_DTXTimerTrampoline.m
+++ b/node_modules/detox/DetoxSync/DetoxSync/DetoxSync/Utils/_DTXTimerTrampoline.m
@@ -162,6 +162,8 @@ const void* __DTXTimerTrampolineKey = &__DTXTimerTrampolineKey;
 		return;
 	}

+    NSLog(@"🤦‍♂️ track: %@", _timer);
+	
 	_tracking = YES;
 	[DTXTimerSyncResource.sharedInstance trackTimerTrampoline:self];
 }
@@ -173,7 +175,7 @@ const void* __DTXTimerTrampolineKey = &__DTXTimerTrampolineKey;
 		return;
 	}

-	//	NSLog(@"🤦‍♂️ untrack: %@", _timer);
+    NSLog(@"🤦‍♂️ untrack: %@", _timer);

 	[DTXTimerSyncResource.sharedInstance untrackTimerTrampoline:self];
 	_tracking = NO;

All this change is doing is re-enabling some of the commented out logging (and adding a little bit extra).

With this change, I wasn't able to reproduce a segfault over around 10 runs (usually, I would have seen at least one if not two in that space).

Presumably, adding the logging is causing a reference to the timer to be retained for longer than it would be otherwise. Either that, or the extra slowness of logging makes the race no longer happen.

Awesome findings man.

Presumably, adding the logging is causing a reference to the timer to be retained for longer than it would be otherwise

Couldn't you easily prove this hypothesis by setting the timer to a variable rather than using it in the log?

shamilovtim avatar Apr 12 '22 14:04 shamilovtim

I wonder if in all of these methods if we create a new timer from the old one that would be good enough to prevent the reference nulling out from its parent?

shamilovtim avatar Apr 12 '22 14:04 shamilovtim

Couldn't you easily prove this hypothesis by setting the timer to a variable rather than using it in the log?

I am very much not an Objective C programmer, but my assumption is that a local variable wouldn't be enough to prevent GC (from past similar experiences with Swift).

I wonder if we can keep a reference within DTXTimerProxy though?

I didn't have time to look further into this today - tomorrow is a new day, and hopefully I can find time again.

ball-hayden avatar Apr 12 '22 15:04 ball-hayden

This article might help: https://stackoverflow.com/questions/4945028/repeating-nstimer-weak-reference-owning-reference-or-ivar

shamilovtim avatar Apr 12 '22 18:04 shamilovtim

So - turns out I was just "lucky".

Rebuilt the branch with the logging changes today and it did segfault still. I guess the good news is that I now have those logs.

Slightly redacted / filtered version here: device-redacted.log

ball-hayden avatar Apr 13 '22 14:04 ball-hayden