ObjC-CollapsingFutures
ObjC-CollapsingFutures copied to clipboard
Crashing on setting _cancelledOnCompletedSource_ClearedOnSet to nil.
The following code often crashes on Signal.
TOCFutureSource* futureResultSource = [TOCFutureSource futureSourceUntil:untilCancelledToken];
PacketHandlerBlock packetHandler = ^(id packet) {
if (![futureResultSource trySetResult:packet]) {;
errorHandler([IgnoredPacketFailure new:@"Received another packet before relay socket events redirected to new handler."], packet, false);
}
};
Crash stack traces are showing that it crashes inside TOCFutureSource _tryComplete:succeeded:
Thread 2 name:
Thread 2 Crashed:
0 libobjc.A.dylib 0x00000001938f0174 objc_release + 20 (objc-object.h:419)
1 Signal 0x0000000100289be0 -[TOCFutureSource _tryComplete:succeeded:] + 84 (TOCFutureAndSource.m:386)
2 Signal 0x0000000100289c8c -[TOCFutureSource trySetResult:] + 144 (TOCFutureAndSource.m:396)
3 Signal 0x000000010002ef38 __102+[CallConnectUtil_Server asyncFirstPacketReceivedAfterStartingSocket:untilCancelled:withErrorHandler:]_block_invoke + 52 (CallConnectUtil_Server.m:183)
4 Signal 0x00000001000d53e0 -[PacketHandler handlePacket:] + 28 (PacketHandler.m:21)
5 Signal 0x00000001000817d0 -[UdpSocket onReceivedData:withEventType:from:] + 428 (UdpSocket.m:114)
6 Signal 0x00000001000815dc onReceivedData + 140 (UdpSocket.m:92)
7 CoreFoundation 0x000000018206f594 __CFSocketPerformV0 + 768 (CFSocket.c:2862)
8 CoreFoundation 0x000000018206c23c __CFRUNLOOP_IS_CALLING_OUT_TO_A_SOURCE0_PERFORM_FUNCTION__ + 20 (CFRunLoop.c:1763)
9 CoreFoundation 0x000000018206b4e0 __CFRunLoopDoSources0 + 260 (CFRunLoop.c:1809)
10 CoreFoundation 0x0000000182069590 __CFRunLoopRun + 708 (CFRunLoop.c:2526)
11 CoreFoundation 0x0000000181f952d0 CFRunLoopRunSpecific + 392 (CFRunLoop.c:2795)
12 Foundation 0x0000000182ecd4c4 -[NSRunLoop(NSRunLoop) runMode:beforeDate:] + 312 (NSRunLoop.m:366)
13 Foundation 0x0000000182eefa5c -[NSRunLoop(NSRunLoop) runUntilDate:] + 104 (NSRunLoop.m:407)
14 Signal 0x0000000100084fc0 -[RunningThreadRunLoopPair runLoopUntilCancelled] + 212 (ThreadManager.m:33)
15 Foundation 0x0000000182fb5db4 __NSThread__main__ + 1068 (NSThread.m:1113)
16 libsystem_pthread.dylib 0x000000019410bdc4 _pthread_body + 160 (pthread.c:644)
17 libsystem_pthread.dylib 0x000000019410bd20 _pthread_start + 156 (pthread.c:680)
18 libsystem_pthread.dylib 0x0000000194108ef4 thread_start + 0 (pthread_asm.s:188)
which refers to
-(bool) _tryComplete:(id)value succeeded:(bool)succeeded {
bool didSet = [future _ForSource_tryComplete:value succeeded:succeeded];
if (!didSet) return false;
[_cancelledOnCompletedSource_ClearedOnSet cancel];
_cancelledOnCompletedSource_ClearedOnSet = nil;
return true;
}
My assumption is that _cancelledOnCompletedSource_ClearedOnSet is already deallocated at that point. Thoughts @Strilanc?
I'm not sure how it could be deallocated at that point.
When a new future source is created, a new cancel token source is created and assigned to the future source's _cancelledOnCompletedSource_ClearedOnSet field. Nothing else holds that reference. In order for it to be deallocated, either one of the completion methods (which clear the field) needs to be called, or the future source needs to be deallocated. Unless the future source is losing references in callbacks in a re-entrant way, and that makes the reference count hit zero despite the existence of a reference on the stack, I'm befuddled.
Maybe an ARC flag is set incorrectly somewhere?
I think I need more context about the failure to solve this. Are you able to trigger this failure at will? If so then the first order of business is to barf NSLogs all over the place, between every line of the _tryComplete: method and also in the dealloc methods of TOCFutureSource, TOCFuture, TOCCancelToken, and TOCCancelTokenSource.
-(bool) _tryComplete:(id)value succeeded:(bool)succeeded {
NSLog(@"[%@ _tryComplete:%@ succeeded:$@] 1", self, value, @(succeeded));
bool didSet = [future _ForSource_tryComplete:value succeeded:succeeded];
NSLog(@"[%@ _tryComplete:%@ succeeded:$@] 2", self, value, @(succeeded));
if (!didSet) return false;
NSLog(@"[%@ _tryComplete:%@ succeeded:$@] 3", self, value, @(succeeded));
[_cancelledOnCompletedSource_ClearedOnSet cancel];
NSLog(@"[%@ _tryComplete:%@ succeeded:$@] 4", self, value, @(succeeded));
_cancelledOnCompletedSource_ClearedOnSet = nil;
NSLog(@"[%@ _tryComplete:%@ succeeded:$@] 5", self, value, @(succeeded));
return true;
}
That will tell us if the dealloc is somehow happening during the method. Note that, if this is an ARC-related bug, introducing code using "self" may mask it, so it's also interesting if it becomes impossible to repro when the logging is there.
No, I'm sadly unable to trigger this at will. I'm unsure what triggers it but it crash logs like the one above are present in very large numbers, all having a pretty similar trace. The fact that the trace reveals that objc_release is crashing the app, my assumption would be that _cancelledOnCompletedSource_ClearedOnSet is deallocated before reaching that point.
This seems to be a quite common crash. Would like to fix this this week.
@Strilanc: Any ideas on a fix?
I stared at it for half an hour, bouncing around the code looking for a stupid mistake, and didn't come up with anything other than "Concurrency is hard. Simulator may not be able to trigger it.".
What other properties does the crash have? Did it only start happening recently? Does it only happen on a certain version of iOS, or on a certain type of iphone?
Do you happen have debug code lying around to put signal into a connect/disconnect loop?
@Strilanc: No, as far as the crash logs I have are going, it's been there since v1. But at the time it was not looked into it because other crashes happened more frequently.
The concurrency clearly adds to the difficulty of debugging.
Is there any other debug information, like stack traces of concurrently running threads or the types of the last couple packets / next couple packets?
I just emailed you a few symbolicated crash logs.
I didn't find anything helpful in the concurrently running threads. Did you?
Not yet, but I also haven't run anything yet.
I went to start testing, but Signal kept crashing instead of running. I tracked it down to a bunch of interesting ways that Signal trashes its own data when you run the tests.
Hopefully this isn't too hard to reproduce.
This is still happening with Signal in the wild, though I've never reproduced it locally.
One hint is that we're not using the vanilla TOCFuture init, but rather
TOCFutureSource *futureResultSource = [TOCFutureSource futureSourceUntil:untilCancelledToken];
Which is invoked through our retriable future operation. We're using futures throughout the app, but this is the only invocation leading to crashes.
TOCUntilOperation operation = ^(TOCCancelToken *internalUntilCancelledToken) {
return [self asyncAttemptResolveThenConnectToUdpRelayDescribedBy:sessionDescriptor
untilCancelled:internalUntilCancelledToken
withErrorHandler:callController.errorHandler];
};
TOCFuture *futureRelayedUdpSocket = [TOCFuture retry:[TOCFuture operationTry:operation]
upToNTimes:MAX_TRY_COUNT
withBaseTimeout:BASE_TIMEOUT_SECONDS
andRetryFactor:RETRY_TIMEOUT_FACTOR
untilCancelled:[callController untilCancelledToken]];
And our retriable future implementation:
+ (TOCFuture *)retry:(TOCUntilOperation)operation
upToNTimes:(NSUInteger)maxTryCount
withBaseTimeout:(NSTimeInterval)baseTimeout
andRetryFactor:(NSTimeInterval)timeoutRetryFactor
untilCancelled:(TOCCancelToken *)untilCancelledToken {
ows_require(operation != nil);
ows_require(maxTryCount >= 0);
ows_require(baseTimeout >= 0);
ows_require(timeoutRetryFactor >= 0);
if (maxTryCount == 0)
return TOCFuture.futureWithTimeoutFailure;
TOCFuture *futureResult =
[TOCFuture futureFromUntilOperation:operation withOperationTimeout:baseTimeout until:untilCancelledToken];
return [futureResult catchTry:^(id error) {
bool operationCancelled = untilCancelledToken.isAlreadyCancelled;
bool operationDidNotTimeout = !futureResult.hasFailedWithTimeout;
if (operationCancelled || operationDidNotTimeout) {
return [TOCFuture futureWithFailure:error];
}
return [self retry:operation
upToNTimes:maxTryCount - 1
withBaseTimeout:baseTimeout * timeoutRetryFactor
andRetryFactor:timeoutRetryFactor
untilCancelled:untilCancelledToken];
}];
}
Thanks, that's useful information. I'll take another stab at reproducing the issue.
A self-contained repro would be ideal, of course. Concurrency bugs are extremely difficult to reproduce, because they can depend on context like "this ran on the networking thread" or "the user was using processor version Y". It's hard to know what to try next, when staring a blank wall of "works in every situation I've tried".
It's hard to know what to try next, when staring a blank wall of "works in every situation I've tried".
Tell me about it.
I'm working on reproducing and I'll give you an update if I succeed.
@michaelkirk I noticed you forked the code. Let me know if you have basic question of how something works that's holding you up. Or if you figure out how to repro this thing!
Obtuse question, but in the first place, why are we nilling that reference explicitly vs just letting it happen on dealloc?
@michaelkirk I don't remember a reason for it to be necessary for correctness, as opposed to saving a bit of memory.
However, I do find it mentally convenient to nil. We're in a context where it's possible to create cycles (e.g. future sources containing callbacks to complete each other), and niling the references means I don't need to carefully consider if it might be keeping the source that holds it alive.
@michaelkirk Although, looking at it now, I'm a bit worried that when there's an exception the field won't be cleared and that could cause a memory leak if the case wasn't carefully reasoned through at the time.