condition_variable related crash on macOS
The ThreadPool class uses a condition_variable to pause worker threads when there is no work, and wake them up again when there is. On macOS there is a (relatively) rare crash that occurs when condition_variable::wait() is called:
[ 00 ] __pthread_kill
[ 01 ] pthread_kill
[ 02 ] 0x7fff6ac8a808
[ 03 ] abort_message
[ 04 ] demangling_terminate_handler()
[ 05 ] initializeNonMetaClass
[ 06 ] std::__terminate(void (*)())
[ 07 ] std::terminate()
[ 08 ] __thread_proxy<…> /Users/runner/work/graphia/graphia/source/shared/utils/threadpool.cpp:56
[ 09 ] _pthread_start
[ 10 ] thread_start
[ 11 ] 0x10148f070
On macOS condition_variable::wait() is basically a wrapper for pthread_cond_wait(pthread_cond_t *cond, pthread_mutex_t *mutex), which it turns out returns EINVAL in the crash case, triggering an exception, which causes the application to terminate. Incidentally, this is printed to stdout:
libc++abi.dylib: terminating with uncaught exception of type std::__1::system_error: condition_variable wait failed: Invalid argument
The "official" reasons for EINVAL are as follows:
[EINVAL]
The value specified by cond, mutex, or abstime is invalid.
[EINVAL]
Different mutexes were supplied for concurrent pthread_cond_wait() or pthread_cond_timedwait() operations on the same condition variable.
[EINVAL]
The mutex was not owned by the current thread at the time of the call.
None of these apply.
After spending many, many hours single stepping the debugger through disassembly of macOS's pthread code, I've come to the conclusion that the source of this EINVAL is from here:
https://github.com/apple/darwin-libpthread/blob/2b46cbcc56ba33791296cd9714b2c90dae185ec7/src/pthread_cond.c#L647
// SUSv3 requires pthread_cond_wait to be a cancellation point
if (conforming == PTHREAD_CONFORM_UNIX03_CANCELABLE) {
pthread_cleanup_push(_pthread_psynch_cond_cleanup, (void *)cond);
updateval = __psynch_cvwait(cond, cvlsgen, ucntval, (pthread_mutex_t *)npmtx, mugen, flags, (int64_t)(then->tv_sec), (int32_t)(then->tv_nsec));
pthread_testcancel();
pthread_cleanup_pop(0);
} else {
updateval = __psynch_cvwait(cond, cvlsgen, ucntval, (pthread_mutex_t *)npmtx, mugen, flags, (int64_t)(then->tv_sec), (int32_t)(then->tv_nsec));
}
if (updateval == (uint32_t)-1) {
int err = errno;
switch (err & 0xff) {
case ETIMEDOUT:
res = ETIMEDOUT;
break;
case EINTR:
// spurious wakeup (unless canceled)
res = 0;
break;
default:
--> res = EINVAL;
break;
}
The __psynch_cvwait syscall returns -1 and it sets errno to EBUSY. This makes no sense at all as the worker thread obviously already owns the mutex at this point so it can't possibly be busy. FYI, errno ends up in the %esi register (this is /usr/lib/system/libsystem_pthread.dylib on darwin 19.6.0/macOS 10.15.7):
6420: e8 99 28 00 00 callq 0x8cbe ## symbol stub for: ___psynch_cvwait
6425: 48 83 c4 10 addq $0x10, %rsp
6429: 41 89 c5 movl %eax, %r13d
642c: 89 df movl %ebx, %edi
642e: e8 ef fc ff ff callq __pthread_testcancel
6433: 48 8d 45 88 leaq -0x78(%rbp), %rax
6437: 48 8b 40 10 movq 0x10(%rax), %rax
643b: 49 89 47 08 movq %rax, 0x8(%r15)
643f: eb 1d jmp 0x645e
6441: 48 8b 4d c8 movq -0x38(%rbp), %rcx
6445: 4c 89 e7 movq %r12, %rdi
6448: 4c 89 f6 movq %r14, %rsi
644b: 89 da movl %ebx, %edx
644d: 41 53 pushq %r11
644f: ff 75 a8 pushq -0x58(%rbp)
6452: e8 67 28 00 00 callq 0x8cbe ## symbol stub for: ___psynch_cvwait
6457: 48 83 c4 10 addq $0x10, %rsp
645b: 41 89 c5 movl %eax, %r13d
645e: 45 85 ed testl %r13d, %r13d
6461: 48 8b 7d c0 movq -0x40(%rbp), %rdi
6465: 74 4d je 0x64b4
6467: 41 83 fd ff cmpl $-0x1, %r13d
646b: 75 33 jne 0x64a0
646d: e8 22 28 00 00 callq 0x8c94 ## symbol stub for: ___error
--> 6472: 8b 30 movl (%rax), %esi
6474: 31 c0 xorl %eax, %eax
6476: 40 80 fe 04 cmpb $0x4, %sil
647a: b9 16 00 00 00 movl $0x16, %ecx
647f: 0f 44 c8 cmovel %eax, %ecx
6482: 40 80 fe 3c cmpb $0x3c, %sil
6486: 41 bd 3c 00 00 00 movl $0x3c, %r13d
648c: 44 0f 45 e9 cmovnel %ecx, %r13d
6490: 4c 89 e7 movq %r12, %rdi
6493: 31 d2 xorl %edx, %edx
6495: e8 02 d2 ff ff callq __pthread_cond_updateval
The source of the EBUSY is potentially the following, but that's difficult to verify as I obviously can't go stepping into kernel code:
https://github.com/apple/darwin-libpthread/blob/2b46cbcc56ba33791296cd9714b2c90dae185ec7/kern/kern_synch.c#L1199
Possibly this is indicative of some kind of starvation state or hitting a buffer limit? I'm not sure. In any case at this point I'm leaning towards it actually being some kind of bug in macOS; I can't obviously see any misuse of threading primitives on my part.
Reproducing the problem basically involves running layout continuously until it happens, using a script to reopen a file periodically to restart the layout. It usually takes about 30-60 minutes to happen.
As a direct response to this issue I've broadened the scope of the task queuing lock so now it only locks and releases once per task set rather than for each individual task (745b0b7c). It's early days, but this appears to have made it harder to reproduce the problem at least (as in, I haven't been able to yet). Nevertheless this change hasn't actually fixed anything that was verified broken in the first place, hence this essay, which I'm writing down in case I need to come back to it and refresh my memory. Hope that's helpful future Tim.
https://probablydance.com/2022/09/17/finding-the-second-bug-in-glibcs-condition-variable/ ???