ITK icon indicating copy to clipboard operation
ITK copied to clipboard

blocked forever in select() from kwsysProcessWaitForPipe when running under TSan on macOS

Open seanm opened this issue 3 years ago • 13 comments

I've been working on finding & fixing ITK tests under Thread Sanitizer (TSan). I've tried on 3 Macs:

  1. intel macOS 10.14
  2. intel macOS 11.6
  3. arm64 macos 12.2

On the last one only, many tests block forever with this backtrace:

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x000000018deab5f8 libsystem_kernel.dylib`__select + 8
    frame #1: 0x00000001015137c0 itkTestDriver`kwsysProcessWaitForPipe(cp=0x000000010a703200, data=0x0000000000000000, length=0x0000000000000000, wd=0x000000016fdfe850) at ProcessUNIX.c:1227:23
    frame #2: 0x0000000101512a90 itkTestDriver`itksysProcess_WaitForData(cp=0x000000010a703200, data=0x0000000000000000, length=0x0000000000000000, userTimeout=0x0000000000000000) at ProcessUNIX.c:1090:11
    frame #3: 0x000000010150c498 itkTestDriver`itksysProcess_WaitForExit(cp=0x000000010a703200, userTimeout=0x0000000000000000) at ProcessUNIX.c:1339:20
    frame #4: 0x0000000100006458 itkTestDriver`TestDriverInvokeProcess(args=size=6) at itkTestDriver.cxx:196:3
    frame #5: 0x0000000100006c40 itkTestDriver`main(argc=7, argv=0x000000016fdff500) at itkTestDriver.cxx:311:14
    frame #6: 0x000000010378108c dyld`start + 520

(lldb) fr sel 1
frame #1: 0x00000001015137c0 itkTestDriver`kwsysProcessWaitForPipe(cp=0x000000010a703200, data=0x0000000000000000, length=0x0000000000000000, wd=0x000000016fdfe850) at ProcessUNIX.c:1227:23
   1224	
   1225	  /* Run select to block until data are available.  Repeat call
   1226	     until it is not interrupted.  */
-> 1227	  while (((numReady = select(max + 1, &cp->PipeSet, 0, 0, timeout)) < 0) &&
   1228	         (errno == EINTR)) {
   1229	  }
   1230	

(lldb) p timeout
(kwsysProcessTimeNative *) $0 = NULL

As per man page "If timeout is a null pointer, the select blocks indefinitely".

Not sure if it's due to arm64 or due to macOS 12.x. I don' have intel on macOS 12 handy.

seanm avatar Feb 18 '22 00:02 seanm

CC @bradking

seanm avatar Feb 18 '22 00:02 seanm

When this is blocking, has the child process exited? There is supposed to be a SIGCHLD handler that writes one byte to a pipe on which select is waiting. It's a standard UNIX idiom for converting signals into select-able events.

bradking avatar Feb 18 '22 15:02 bradking

I put a breakpoint at ITK/Modules/ThirdParty/KWSys/src/KWSys/ProcessUNIX.c:2832 but don't hit it. Is that where you mean?

seanm avatar Feb 18 '22 16:02 seanm

And just tried with my non-TSan build on the same box and I did break on that line.

seanm avatar Feb 18 '22 16:02 seanm

I wonder if TSan is affecting which thread gets the signal delivery, and whether the handler runs.

bradking avatar Feb 18 '22 16:02 bradking

It would be nice to separate intel vs arm and macOS 11 vs 12. I do have some Intel + 12 Macs in the office, but none really setup for building ITK (no Xcode, no git, etc.)

I've tried adding some printf statements to look at things like errno, and then the bug usually does not occur. So it's perhaps sensitive to timing.

seanm avatar Feb 18 '22 16:02 seanm

I've confirmed it also happens on arm64 with macOS 11.6 (same machine, booted from different partition).

BTW an example test is ImageAdaptor2Test

seanm avatar Feb 18 '22 22:02 seanm

So in fact this happens on both Intel and arm, and both Big Sur and Monterey. I was thrown off originally because I had different build options on different Macs. All the tests that have this issue require BUILD_EXAMPLES enabled. Once I enabled that everywhere, I see the failures everywhere.

I wonder if this would happen on non-Mac OSes too...?

seanm avatar Feb 22 '22 19:02 seanm

@bradking any additional thoughts here?

seanm avatar Mar 16 '22 22:03 seanm

My above hypothesis about signal delivery thread selection is my only idea about how TSan breaks this. The code in question is using a long-standard UNIX idiom. If it doesn't work under a given analysis tool, it's the fault of that tool, not the code. If you can narrow it to a minimal test case, then report it to TSan upstream.

bradking avatar Mar 17 '22 12:03 bradking

Been playing with TSan again... have not debugged this more, but just a note to say it still occurs with today's ITK master and macOS 14 on arm64.

seanm avatar Jun 03 '24 23:06 seanm

I was going to add that itkIOTransformHDF5TestCompressed and itkIOTransformHDF5TestCompressed also get stuck forever in select(), but after letting them run with huge timeout they actually are not stuck forever, only for half an eternity:

1/2 Test #1456: itkIOTransformHDF5Test .............   Passed  39813.09 sec
2/2 Test #1457: itkIOTransformHDF5TestCompressed ...   Passed  39816.98 sec

That's 11 hours each. Without TSan they pass in 1.3 seconds on the same hardware. So some pathological case is being hit here. Next I should try a different OS...

seanm avatar Jun 06 '24 15:06 seanm

Just retesting some tickets... this still occurs in current master ec7569203a2f9a5c8febba3fad89b1dde5c0cd36.

seanm avatar Jan 23 '25 17:01 seanm