ITK
ITK copied to clipboard
blocked forever in select() from kwsysProcessWaitForPipe when running under TSan on macOS
I've been working on finding & fixing ITK tests under Thread Sanitizer (TSan). I've tried on 3 Macs:
- intel macOS 10.14
- intel macOS 11.6
- arm64 macos 12.2
On the last one only, many tests block forever with this backtrace:
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
* frame #0: 0x000000018deab5f8 libsystem_kernel.dylib`__select + 8
frame #1: 0x00000001015137c0 itkTestDriver`kwsysProcessWaitForPipe(cp=0x000000010a703200, data=0x0000000000000000, length=0x0000000000000000, wd=0x000000016fdfe850) at ProcessUNIX.c:1227:23
frame #2: 0x0000000101512a90 itkTestDriver`itksysProcess_WaitForData(cp=0x000000010a703200, data=0x0000000000000000, length=0x0000000000000000, userTimeout=0x0000000000000000) at ProcessUNIX.c:1090:11
frame #3: 0x000000010150c498 itkTestDriver`itksysProcess_WaitForExit(cp=0x000000010a703200, userTimeout=0x0000000000000000) at ProcessUNIX.c:1339:20
frame #4: 0x0000000100006458 itkTestDriver`TestDriverInvokeProcess(args=size=6) at itkTestDriver.cxx:196:3
frame #5: 0x0000000100006c40 itkTestDriver`main(argc=7, argv=0x000000016fdff500) at itkTestDriver.cxx:311:14
frame #6: 0x000000010378108c dyld`start + 520
(lldb) fr sel 1
frame #1: 0x00000001015137c0 itkTestDriver`kwsysProcessWaitForPipe(cp=0x000000010a703200, data=0x0000000000000000, length=0x0000000000000000, wd=0x000000016fdfe850) at ProcessUNIX.c:1227:23
1224
1225 /* Run select to block until data are available. Repeat call
1226 until it is not interrupted. */
-> 1227 while (((numReady = select(max + 1, &cp->PipeSet, 0, 0, timeout)) < 0) &&
1228 (errno == EINTR)) {
1229 }
1230
(lldb) p timeout
(kwsysProcessTimeNative *) $0 = NULL
As per man page "If timeout is a null pointer, the select blocks indefinitely".
Not sure if it's due to arm64 or due to macOS 12.x. I don' have intel on macOS 12 handy.
CC @bradking
When this is blocking, has the child process exited? There is supposed to be a SIGCHLD handler that writes one byte to a pipe on which select is waiting. It's a standard UNIX idiom for converting signals into select-able events.
I put a breakpoint at ITK/Modules/ThirdParty/KWSys/src/KWSys/ProcessUNIX.c:2832 but don't hit it. Is that where you mean?
And just tried with my non-TSan build on the same box and I did break on that line.
I wonder if TSan is affecting which thread gets the signal delivery, and whether the handler runs.
It would be nice to separate intel vs arm and macOS 11 vs 12. I do have some Intel + 12 Macs in the office, but none really setup for building ITK (no Xcode, no git, etc.)
I've tried adding some printf statements to look at things like errno, and then the bug usually does not occur. So it's perhaps sensitive to timing.
I've confirmed it also happens on arm64 with macOS 11.6 (same machine, booted from different partition).
BTW an example test is ImageAdaptor2Test
So in fact this happens on both Intel and arm, and both Big Sur and Monterey. I was thrown off originally because I had different build options on different Macs. All the tests that have this issue require BUILD_EXAMPLES enabled. Once I enabled that everywhere, I see the failures everywhere.
I wonder if this would happen on non-Mac OSes too...?
@bradking any additional thoughts here?
My above hypothesis about signal delivery thread selection is my only idea about how TSan breaks this. The code in question is using a long-standard UNIX idiom. If it doesn't work under a given analysis tool, it's the fault of that tool, not the code. If you can narrow it to a minimal test case, then report it to TSan upstream.
Been playing with TSan again... have not debugged this more, but just a note to say it still occurs with today's ITK master and macOS 14 on arm64.
I was going to add that itkIOTransformHDF5TestCompressed and itkIOTransformHDF5TestCompressed also get stuck forever in select(), but after letting them run with huge timeout they actually are not stuck forever, only for half an eternity:
1/2 Test #1456: itkIOTransformHDF5Test ............. Passed 39813.09 sec
2/2 Test #1457: itkIOTransformHDF5TestCompressed ... Passed 39816.98 sec
That's 11 hours each. Without TSan they pass in 1.3 seconds on the same hardware. So some pathological case is being hit here. Next I should try a different OS...
Just retesting some tickets... this still occurs in current master ec7569203a2f9a5c8febba3fad89b1dde5c0cd36.