ObjC-CollapsingFutures Crashing on setting _cancelledOnCompletedSource

The following code often crashes on Signal.


  TOCFutureSource* futureResultSource = [TOCFutureSource futureSourceUntil:untilCancelledToken];

    PacketHandlerBlock packetHandler = ^(id packet) {
        if (![futureResultSource trySetResult:packet]) {;
            errorHandler([IgnoredPacketFailure new:@"Received another packet before relay socket events redirected to new handler."], packet, false);
        }
    };

Crash stack traces are showing that it crashes inside TOCFutureSource _tryComplete:succeeded:

Thread 2 name:
Thread 2 Crashed:
0   libobjc.A.dylib                 0x00000001938f0174 objc_release + 20 (objc-object.h:419)
1   Signal                          0x0000000100289be0 -[TOCFutureSource _tryComplete:succeeded:] + 84 (TOCFutureAndSource.m:386)
2   Signal                          0x0000000100289c8c -[TOCFutureSource trySetResult:] + 144 (TOCFutureAndSource.m:396)
3   Signal                          0x000000010002ef38 __102+[CallConnectUtil_Server asyncFirstPacketReceivedAfterStartingSocket:untilCancelled:withErrorHandler:]_block_invoke + 52 (CallConnectUtil_Server.m:183)
4   Signal                          0x00000001000d53e0 -[PacketHandler handlePacket:] + 28 (PacketHandler.m:21)
5   Signal                          0x00000001000817d0 -[UdpSocket onReceivedData:withEventType:from:] + 428 (UdpSocket.m:114)
6   Signal                          0x00000001000815dc onReceivedData + 140 (UdpSocket.m:92)
7   CoreFoundation                  0x000000018206f594 __CFSocketPerformV0 + 768 (CFSocket.c:2862)
8   CoreFoundation                  0x000000018206c23c __CFRUNLOOP_IS_CALLING_OUT_TO_A_SOURCE0_PERFORM_FUNCTION__ + 20 (CFRunLoop.c:1763)
9   CoreFoundation                  0x000000018206b4e0 __CFRunLoopDoSources0 + 260 (CFRunLoop.c:1809)
10  CoreFoundation                  0x0000000182069590 __CFRunLoopRun + 708 (CFRunLoop.c:2526)
11  CoreFoundation                  0x0000000181f952d0 CFRunLoopRunSpecific + 392 (CFRunLoop.c:2795)
12  Foundation                      0x0000000182ecd4c4 -[NSRunLoop(NSRunLoop) runMode:beforeDate:] + 312 (NSRunLoop.m:366)
13  Foundation                      0x0000000182eefa5c -[NSRunLoop(NSRunLoop) runUntilDate:] + 104 (NSRunLoop.m:407)
14  Signal                          0x0000000100084fc0 -[RunningThreadRunLoopPair runLoopUntilCancelled] + 212 (ThreadManager.m:33)
15  Foundation                      0x0000000182fb5db4 __NSThread__main__ + 1068 (NSThread.m:1113)
16  libsystem_pthread.dylib         0x000000019410bdc4 _pthread_body + 160 (pthread.c:644)
17  libsystem_pthread.dylib         0x000000019410bd20 _pthread_start + 156 (pthread.c:680)
18  libsystem_pthread.dylib         0x0000000194108ef4 thread_start + 0 (pthread_asm.s:188)

which refers to

-(bool) _tryComplete:(id)value succeeded:(bool)succeeded {
    bool didSet = [future _ForSource_tryComplete:value succeeded:succeeded];
    if (!didSet) return false;

    [_cancelledOnCompletedSource_ClearedOnSet cancel];
    _cancelledOnCompletedSource_ClearedOnSet = nil;
    return true;
}

My assumption is that _cancelledOnCompletedSource_ClearedOnSet is already deallocated at that point. Thoughts @Strilanc?

Jun 19 '15 11:06 FredericJacobs

I'm not sure how it could be deallocated at that point.

When a new future source is created, a new cancel token source is created and assigned to the future source's _cancelledOnCompletedSource_ClearedOnSet field. Nothing else holds that reference. In order for it to be deallocated, either one of the completion methods (which clear the field) needs to be called, or the future source needs to be deallocated. Unless the future source is losing references in callbacks in a re-entrant way, and that makes the reference count hit zero despite the existence of a reference on the stack, I'm befuddled.

Maybe an ARC flag is set incorrectly somewhere?

I think I need more context about the failure to solve this. Are you able to trigger this failure at will? If so then the first order of business is to barf NSLogs all over the place, between every line of the _tryComplete: method and also in the dealloc methods of TOCFutureSource, TOCFuture, TOCCancelToken, and TOCCancelTokenSource.

-(bool) _tryComplete:(id)value succeeded:(bool)succeeded {
    NSLog(@"[%@ _tryComplete:%@ succeeded:$@] 1", self, value, @(succeeded));
    bool didSet = [future _ForSource_tryComplete:value succeeded:succeeded];
    NSLog(@"[%@ _tryComplete:%@ succeeded:$@] 2", self, value, @(succeeded));
    if (!didSet) return false;
    NSLog(@"[%@ _tryComplete:%@ succeeded:$@] 3", self, value, @(succeeded));

    [_cancelledOnCompletedSource_ClearedOnSet cancel];
    NSLog(@"[%@ _tryComplete:%@ succeeded:$@] 4", self, value, @(succeeded));
    _cancelledOnCompletedSource_ClearedOnSet = nil;
    NSLog(@"[%@ _tryComplete:%@ succeeded:$@] 5", self, value, @(succeeded));
    return true;
}

That will tell us if the dealloc is somehow happening during the method. Note that, if this is an ARC-related bug, introducing code using "self" may mask it, so it's also interesting if it becomes impossible to repro when the logging is there.

Jun 19 '15 15:06 Strilanc

No, I'm sadly unable to trigger this at will. I'm unsure what triggers it but it crash logs like the one above are present in very large numbers, all having a pretty similar trace. The fact that the trace reveals that objc_release is crashing the app, my assumption would be that _cancelledOnCompletedSource_ClearedOnSet is deallocated before reaching that point.

Jun 19 '15 21:06 FredericJacobs

This seems to be a quite common crash. Would like to fix this this week.

@Strilanc: Any ideas on a fix?

Jun 22 '15 09:06 FredericJacobs

I stared at it for half an hour, bouncing around the code looking for a stupid mistake, and didn't come up with anything other than "Concurrency is hard. Simulator may not be able to trigger it.".

What other properties does the crash have? Did it only start happening recently? Does it only happen on a certain version of iOS, or on a certain type of iphone?

Do you happen have debug code lying around to put signal into a connect/disconnect loop?

Jun 22 '15 11:06 Strilanc

@Strilanc: No, as far as the crash logs I have are going, it's been there since v1. But at the time it was not looked into it because other crashes happened more frequently.

The concurrency clearly adds to the difficulty of debugging.

Jun 22 '15 11:06 FredericJacobs

Is there any other debug information, like stack traces of concurrently running threads or the types of the last couple packets / next couple packets?

Jun 23 '15 10:06 Strilanc

I just emailed you a few symbolicated crash logs.

Jun 23 '15 11:06 FredericJacobs

I didn't find anything helpful in the concurrently running threads. Did you?

Jun 24 '15 09:06 FredericJacobs

Not yet, but I also haven't run anything yet.

I went to start testing, but Signal kept crashing instead of running. I tracked it down to a bunch of interesting ways that Signal trashes its own data when you run the tests.

Hopefully this isn't too hard to reproduce.

Jun 25 '15 11:06 Strilanc

This is still happening with Signal in the wild, though I've never reproduced it locally.

One hint is that we're not using the vanilla TOCFuture init, but rather

TOCFutureSource *futureResultSource = [TOCFutureSource futureSourceUntil:untilCancelledToken];

Which is invoked through our retriable future operation. We're using futures throughout the app, but this is the only invocation leading to crashes.

   TOCUntilOperation operation = ^(TOCCancelToken *internalUntilCancelledToken) {
      return [self asyncAttemptResolveThenConnectToUdpRelayDescribedBy:sessionDescriptor
                                                        untilCancelled:internalUntilCancelledToken
                                                      withErrorHandler:callController.errorHandler];
    };

    TOCFuture *futureRelayedUdpSocket = [TOCFuture retry:[TOCFuture operationTry:operation]
                                              upToNTimes:MAX_TRY_COUNT
                                         withBaseTimeout:BASE_TIMEOUT_SECONDS
                                          andRetryFactor:RETRY_TIMEOUT_FACTOR
                                          untilCancelled:[callController untilCancelledToken]];

And our retriable future implementation:

+ (TOCFuture *)retry:(TOCUntilOperation)operation
          upToNTimes:(NSUInteger)maxTryCount
     withBaseTimeout:(NSTimeInterval)baseTimeout
      andRetryFactor:(NSTimeInterval)timeoutRetryFactor
      untilCancelled:(TOCCancelToken *)untilCancelledToken {
    ows_require(operation != nil);
    ows_require(maxTryCount >= 0);
    ows_require(baseTimeout >= 0);
    ows_require(timeoutRetryFactor >= 0);

    if (maxTryCount == 0)
        return TOCFuture.futureWithTimeoutFailure;

    TOCFuture *futureResult =
        [TOCFuture futureFromUntilOperation:operation withOperationTimeout:baseTimeout until:untilCancelledToken];

    return [futureResult catchTry:^(id error) {
      bool operationCancelled     = untilCancelledToken.isAlreadyCancelled;
      bool operationDidNotTimeout = !futureResult.hasFailedWithTimeout;
      if (operationCancelled || operationDidNotTimeout) {
          return [TOCFuture futureWithFailure:error];
      }

      return [self retry:operation
               upToNTimes:maxTryCount - 1
          withBaseTimeout:baseTimeout * timeoutRetryFactor
           andRetryFactor:timeoutRetryFactor
           untilCancelled:untilCancelledToken];
    }];
}

Jul 05 '16 17:07 michaelkirk

Thanks, that's useful information. I'll take another stab at reproducing the issue.

A self-contained repro would be ideal, of course. Concurrency bugs are extremely difficult to reproduce, because they can depend on context like "this ran on the networking thread" or "the user was using processor version Y". It's hard to know what to try next, when staring a blank wall of "works in every situation I've tried".

Jul 05 '16 20:07 Strilanc

It's hard to know what to try next, when staring a blank wall of "works in every situation I've tried".

Tell me about it.

I'm working on reproducing and I'll give you an update if I succeed.

Jul 05 '16 20:07 michaelkirk

@michaelkirk I noticed you forked the code. Let me know if you have basic question of how something works that's holding you up. Or if you figure out how to repro this thing!

Oct 24 '16 15:10 Strilanc

Obtuse question, but in the first place, why are we nilling that reference explicitly vs just letting it happen on dealloc?

Oct 31 '16 17:10 michaelkirk

@michaelkirk I don't remember a reason for it to be necessary for correctness, as opposed to saving a bit of memory.

However, I do find it mentally convenient to nil. We're in a context where it's possible to create cycles (e.g. future sources containing callbacks to complete each other), and niling the references means I don't need to carefully consider if it might be keeping the source that holds it alive.

Oct 31 '16 20:10 Strilanc

@michaelkirk Although, looking at it now, I'm a bit worried that when there's an exception the field won't be cleared and that could cause a memory leak if the case wasn't carefully reasoned through at the time.

Oct 31 '16 20:10 Strilanc

ObjC-CollapsingFutures
ObjC-CollapsingFutures copied to clipboard

Crashing on setting _cancelledOnCompletedSource_ClearedOnSet to nil.

ObjC-CollapsingFutures ObjC-CollapsingFutures copied to clipboard

Crashing on setting _cancelledOnCompletedSource_ClearedOnSet to nil.

ObjC-CollapsingFutures
ObjC-CollapsingFutures copied to clipboard