graphia icon indicating copy to clipboard operation
graphia copied to clipboard

condition_variable related crash on macOS

Open timangus opened this issue 4 years ago • 1 comments

The ThreadPool class uses a condition_variable to pause worker threads when there is no work, and wake them up again when there is. On macOS there is a (relatively) rare crash that occurs when condition_variable::wait() is called:

[ 00 ] __pthread_kill
[ 01 ] pthread_kill
[ 02 ] 0x7fff6ac8a808
[ 03 ] abort_message
[ 04 ] demangling_terminate_handler()
[ 05 ] initializeNonMetaClass
[ 06 ] std::__terminate(void (*)())
[ 07 ] std::terminate()
[ 08 ] __thread_proxy<…> ‎/Users/runner/work/graphia/graphia/source/shared/utils/threadpool.cpp:56
[ 09 ] _pthread_start
[ 10 ] thread_start
[ 11 ] 0x10148f070

On macOS condition_variable::wait() is basically a wrapper for pthread_cond_wait(pthread_cond_t *cond, pthread_mutex_t *mutex), which it turns out returns EINVAL in the crash case, triggering an exception, which causes the application to terminate. Incidentally, this is printed to stdout:

libc++abi.dylib: terminating with uncaught exception of type std::__1::system_error: condition_variable wait failed: Invalid argument

The "official" reasons for EINVAL are as follows:

[EINVAL]
    The value specified by cond, mutex, or abstime is invalid.
[EINVAL]
    Different mutexes were supplied for concurrent pthread_cond_wait() or pthread_cond_timedwait() operations on the same condition variable.
[EINVAL]
    The mutex was not owned by the current thread at the time of the call.

None of these apply.

After spending many, many hours single stepping the debugger through disassembly of macOS's pthread code, I've come to the conclusion that the source of this EINVAL is from here:

https://github.com/apple/darwin-libpthread/blob/2b46cbcc56ba33791296cd9714b2c90dae185ec7/src/pthread_cond.c#L647

	// SUSv3 requires pthread_cond_wait to be a cancellation point
	if (conforming == PTHREAD_CONFORM_UNIX03_CANCELABLE) {
		pthread_cleanup_push(_pthread_psynch_cond_cleanup, (void *)cond);
		updateval = __psynch_cvwait(cond, cvlsgen, ucntval, (pthread_mutex_t *)npmtx, mugen, flags, (int64_t)(then->tv_sec), (int32_t)(then->tv_nsec));
		pthread_testcancel();
		pthread_cleanup_pop(0);
	} else {
		updateval = __psynch_cvwait(cond, cvlsgen, ucntval, (pthread_mutex_t *)npmtx, mugen, flags, (int64_t)(then->tv_sec), (int32_t)(then->tv_nsec));
	}

	if (updateval == (uint32_t)-1) {
		int err = errno;
		switch (err & 0xff) {
		case ETIMEDOUT:
			res = ETIMEDOUT;
			break;
		case EINTR:
			// spurious wakeup (unless canceled)
			res = 0;
			break;
		default:
-->			res = EINVAL;
			break;
		}

The __psynch_cvwait syscall returns -1 and it sets errno to EBUSY. This makes no sense at all as the worker thread obviously already owns the mutex at this point so it can't possibly be busy. FYI, errno ends up in the %esi register (this is /usr/lib/system/libsystem_pthread.dylib on darwin 19.6.0/macOS 10.15.7):

    6420:       e8 99 28 00 00  callq   0x8cbe ## symbol stub for: ___psynch_cvwait
    6425:       48 83 c4 10     addq    $0x10, %rsp
    6429:       41 89 c5        movl    %eax, %r13d
    642c:       89 df   movl    %ebx, %edi
    642e:       e8 ef fc ff ff  callq   __pthread_testcancel
    6433:       48 8d 45 88     leaq    -0x78(%rbp), %rax
    6437:       48 8b 40 10     movq    0x10(%rax), %rax
    643b:       49 89 47 08     movq    %rax, 0x8(%r15)
    643f:       eb 1d   jmp     0x645e
    6441:       48 8b 4d c8     movq    -0x38(%rbp), %rcx
    6445:       4c 89 e7        movq    %r12, %rdi
    6448:       4c 89 f6        movq    %r14, %rsi
    644b:       89 da   movl    %ebx, %edx
    644d:       41 53   pushq   %r11
    644f:       ff 75 a8        pushq   -0x58(%rbp)
    6452:       e8 67 28 00 00  callq   0x8cbe ## symbol stub for: ___psynch_cvwait
    6457:       48 83 c4 10     addq    $0x10, %rsp
    645b:       41 89 c5        movl    %eax, %r13d
    645e:       45 85 ed        testl   %r13d, %r13d
    6461:       48 8b 7d c0     movq    -0x40(%rbp), %rdi
    6465:       74 4d   je      0x64b4
    6467:       41 83 fd ff     cmpl    $-0x1, %r13d
    646b:       75 33   jne     0x64a0
    646d:       e8 22 28 00 00  callq   0x8c94 ## symbol stub for: ___error
--> 6472:       8b 30   movl    (%rax), %esi
    6474:       31 c0   xorl    %eax, %eax
    6476:       40 80 fe 04     cmpb    $0x4, %sil
    647a:       b9 16 00 00 00  movl    $0x16, %ecx
    647f:       0f 44 c8        cmovel  %eax, %ecx
    6482:       40 80 fe 3c     cmpb    $0x3c, %sil
    6486:       41 bd 3c 00 00 00       movl    $0x3c, %r13d
    648c:       44 0f 45 e9     cmovnel %ecx, %r13d
    6490:       4c 89 e7        movq    %r12, %rdi
    6493:       31 d2   xorl    %edx, %edx
    6495:       e8 02 d2 ff ff  callq   __pthread_cond_updateval

The source of the EBUSY is potentially the following, but that's difficult to verify as I obviously can't go stepping into kernel code:

https://github.com/apple/darwin-libpthread/blob/2b46cbcc56ba33791296cd9714b2c90dae185ec7/kern/kern_synch.c#L1199

Possibly this is indicative of some kind of starvation state or hitting a buffer limit? I'm not sure. In any case at this point I'm leaning towards it actually being some kind of bug in macOS; I can't obviously see any misuse of threading primitives on my part.

Reproducing the problem basically involves running layout continuously until it happens, using a script to reopen a file periodically to restart the layout. It usually takes about 30-60 minutes to happen.

As a direct response to this issue I've broadened the scope of the task queuing lock so now it only locks and releases once per task set rather than for each individual task (745b0b7c). It's early days, but this appears to have made it harder to reproduce the problem at least (as in, I haven't been able to yet). Nevertheless this change hasn't actually fixed anything that was verified broken in the first place, hence this essay, which I'm writing down in case I need to come back to it and refresh my memory. Hope that's helpful future Tim.

timangus avatar Nov 03 '21 19:11 timangus

https://probablydance.com/2022/09/17/finding-the-second-bug-in-glibcs-condition-variable/ ???

timangus avatar Sep 20 '22 15:09 timangus