charm
charm copied to clipboard
Implement CmiTLS for arm64 and ppc64le
- [x] Linux arm64/aarch64/arm8 verified correct by CI
- [ ] MPI arm64 is failing?
- [ ] ppc64le does not work for unknown reasons. This might be caused by alignment, the displacement value, or the assembly in setTLS().
- [x] 32-bit arm7 will not work because the segment pointer register is read-only in user mode. Symptom: crash in setTLS, SIGILL (illegal instruction)
- [x] macOS ARM64
This pull request fixes 1 alert when merging 1858b4880cb5193889189fd7f88b63605f867328 into 99e9a26c44c1fc885f97ff0834e2eeac59910316 - view on LGTM.com
fixed alerts:
- 1 for FIXME comment
This pull request fixes 1 alert when merging fee41644f86dad3286863dbfed2bb587c946eb4d into 99e9a26c44c1fc885f97ff0834e2eeac59910316 - view on LGTM.com
fixed alerts:
- 1 for FIXME comment
Should we merge this patch and open an issue for TLS support on ppc64le (if there isn't already one)?
Should we merge this patch and open an issue for TLS support on ppc64le (if there isn't already one)?
Not yet, because the MPI-ARM64 CI is failing now.
Should we merge this patch and open an issue for TLS support on ppc64le (if there isn't already one)?
Not yet, because the MPI-ARM64 CI is failing now.
This looks like a different error, so perhaps it's unrelated, but the MPI-ARM64 build has been very flaky for me today, often just not starting at all.
It hangs in the tlsglobals test. That's a blocker.
Perhaps we should separate this PR into mac-arm64 support and merge that, then open a separate issue/PR for mpi-arm64 and ppc64le support
Merging Apple Silicon support first is a good idea.
It is unfortunate that ARM and POWER CIs through Travis are no longer available to us.
I've been able to look into the mpi-linux-arm8 issue under Asahi Linux on an M1 Mac. A null pointer is dereferenced inside libc during a printf call from the tlsglobals test. It can also happen on netlrts. There are similarities with #1858, #2814, #2932 but not close enough to be helpful.
My TLS implementation may not be correct, or there may be some issue related to how we privatize the entire thread-local storage of a process and not just the MPI user program.
Updated this with the macOS parts split out, now merged, and the docs updated. No movement on the failures.
I've debugged the AArch64 failures more and made the following observations:
- If I remove all (f)printf statements from the privatization test, all tests pass. This even includes pieglobals with tlsglobals integrated! (#3567)
- If I keep the statements but instead enable
CMI_IO_BUFFER_EXPLICIT
which manually callssetvbuf
with a buffer the runtime manages, the tests pass on the MPI layer! - I suspect this would resolve netlrts too, but
CMI_IO_BUFFER_EXPLICIT
is disabled byCMK_CMIPRINTF_IS_A_BUILTIN
which machine layers using the C++ Charmrun use. I see no fundamental conflict, it would just need a little refactoring to support both.
For reference, here are the backtraces for the issue:
symptom: crash
(lldb) bt
* thread #3, name = 'tlsglobals-cxx', stop reason = signal SIGSEGV: invalid address (fault address: 0x0)
* frame #0: 0x0000fffff75b9958 libc.so.6`unwind_stop(version=<unavailable>, actions=<unavailable>, exc_class=<unavailable>, exc_obj=<unavailable>, context=<unavailable>, stop_parameter=0x000038e4e00ff960) at unwind.c:80:8
frame #1: 0x0000fffff76fe828 libgcc_s.so.1`_Unwind_ForcedUnwind_Phase2(exc=0x000038e4e00ffcb0, context=0x000038e4e00ff1e0, frames_p=0x000038e4e00fee18) at unwind.inc:171:20
frame #2: 0x0000fffff76fec5c libgcc_s.so.1`_Unwind_ForcedUnwind(exc=0x000038e4e00ffcb0, stop=(libc.so.6`unwind_stop at unwind.c:43:1), stop_argument=0x000038e4e00ff960) at unwind.inc:218:10
frame #3: 0x0000fffff75b9a30 libc.so.6`__GI___pthread_unwind(buf=<unavailable>) at unwind.c:130:3
frame #4: 0x0000fffff75acec4 libc.so.6`__GI___pthread_enable_asynccancel [inlined] __do_cancel at pthreadP.h:280:3
frame #5: 0x0000fffff75acea0 libc.so.6`__GI___pthread_enable_asynccancel at cancellation.c:48:8
frame #6: 0x0000fffff760c30c libc.so.6`__GI___libc_write at write.c:26:10
frame #7: 0x0000fffff760c308 libc.so.6`__GI___libc_write(fd=<unavailable>, buf=0x0000ffffe8002a90, nbytes=55) at write.c:24:1
frame #8: 0x0000fffff75a817c libc.so.6`_IO_new_file_write(f=0x0000fffff76d5440, data=0x0000ffffe8002a90, n=55) at fileops.c:1180:9
frame #9: 0x0000fffff75a7548 libc.so.6`new_do_write(fp=0x0000fffff76d5440, data="#01 - [1](1) - 0x38e4e0100024 - privatization - passed\n", to_do=55) at fileops.c:448:11
frame #10: 0x0000fffff75a9260 libc.so.6`_IO_new_do_write(fp=<unavailable>, data=<unavailable>, to_do=55) at fileops.c:425:16
frame #11: 0x0000fffff75a8840 libc.so.6`_IO_new_file_xsputn at fileops.c:1243:11
frame #12: 0x0000fffff75a87ec libc.so.6`_IO_new_file_xsputn(f=0x0000fffff76d5440, data=<unavailable>, n=1) at fileops.c:1196:1
frame #13: 0x0000fffff7593634 libc.so.6`__vfprintf_internal at vfprintf-internal.c:239:16
frame #14: 0x0000fffff7593618 libc.so.6`__vfprintf_internal(s=0x0000fffff76d5440, format="#%02d - [%d](%d) - 0x%012lx - %s - %s\n", ap=va_list @ 0x0000aaaaabfa9260, mode_flags=0) at vfprintf-internal.c:1593:7
frame #15: 0x0000fffff7582364 libc.so.6`__printf(format=<unavailable>) at printf.c:33:10
frame #16: 0x0000aaaaaae66860 tlsglobals-cxx`::print_test_result(test=1, rank=1, my_wth=1, ptr=0x000038e4e0100024, name="privatization", result=1) at framework.C:32:9
frame #17: 0x0000aaaaaae66934 tlsglobals-cxx`::test_privatization_(failed=0x000038e4e00fff1c, test=0x000038e4e00fff24, rank=0x000038e4e00fff3c, my_wth=0x000038e4e00fff28, operation=0x000038e4e00fff20, global=0x000038e4e0100024) at framework.C:51:32
frame #18: 0x0000aaaaaae66d14 tlsglobals-cxx`::perform_test_batch_(failed=0x000038e4e00fff1c, test=0x000038e4e00fff24, rank=0x000038e4e00fff3c, my_wth=0x000038e4e00fff28, operation=0x000038e4e00fff20) at test.C:110:21
frame #19: 0x0000aaaaaae66ab4 tlsglobals-cxx`::perform_test_batch_dispatch(failed=0x000038e4e00fff1c, test=0x000038e4e00fff24, rank=0x000038e4e00fff3c, my_wth=0x000038e4e00fff28, operation=0x000038e4e00fff20) at framework.C:81:21
frame #20: 0x0000aaaaaae66b60 tlsglobals-cxx`::privatization_test_framework_() at framework.C:104:30
frame #21: 0x0000aaaaaae66f50 tlsglobals-cxx`::AMPI_Main(argc=1, argv=0x000038e4e010c3c0) at test.C:259:31
frame #22: 0x0000aaaaaaf0b210 tlsglobals-cxx`::AMPI_threadstart(data=0x0000ffffe8042f40) at ampi.C:1154:19
frame #23: 0x0000aaaaaae67a98 tlsglobals-cxx`::startTCharmThread(msg=0x0000ffffe8042f20) at tcharm.C:182:10
frame #24: 0x0000aaaaab123fdc tlsglobals-cxx`CthStartThread(fn1=43690, fn2=2867231336, arg1=65535, arg2=3892588320) at threads.C:1795:8
frame #25: 0x0000fffff757ab80 libc.so.6 at setcontext.S:123
symptom: hang
(lldb) bt
* thread #1, name = 'tlsglobals-f90', stop reason = signal SIGSTOP
* frame #0: 0x0000fffff78ecf44 libc.so.6`__GI___pthread_disable_asynccancel at futex-internal.h:146:13
frame #1: 0x0000fffff794c338 libc.so.6`__GI___libc_write at write.c:26:10
frame #2: 0x0000fffff794c308 libc.so.6`__GI___libc_write(fd=<unavailable>, buf=0x0000aaaaab3eb820, nbytes=28) at write.c:24:1
frame #3: 0x0000fffff78e817c libc.so.6`_IO_new_file_write(f=0x0000fffff7a15440, data=0x0000aaaaab3eb820, n=28) at fileops.c:1180:9
frame #4: 0x0000fffff78e7548 libc.so.6`new_do_write(fp=0x0000fffff7a15440, data="Beginning round of testing.\nee with: GreedyRefine \ndefault configuration.\nance benchmarking (build without --enable-error-checking to do so).\n..."..., to_do=28) at fileops.c:448:11
frame #5: 0x0000fffff78e9260 libc.so.6`_IO_new_do_write(fp=0x0000fffff7a15440, data=<unavailable>, to_do=28) at fileops.c:425:16
frame #6: 0x0000fffff78e9680 libc.so.6`_IO_new_file_overflow(f=0x0000fffff7a15440, ch=10) at fileops.c:783:9
frame #7: 0x0000fffff78ddc10 libc.so.6`__GI__IO_puts(str="Beginning round of testing.") at ioputs.c:41:10
frame #8: 0x0000aaaaaae5ab5c tlsglobals-f90`::perform_test_batch_dispatch(failed=0x00000004000fff1c, test=0x00000004000fff24, rank=0x00000004000fff3c, my_wth=0x00000004000fff28, operation=0x00000004000fff20) at framework.C:79:11
frame #9: 0x0000aaaaaae5ac20 tlsglobals-f90`::privatization_test_framework_() at framework.C:104:30
frame #10: 0x0000aaaaaae5aed0 tlsglobals-f90`mpi_main_ at test-tlsglobals.f90:61:43
frame #11: 0x0000aaaaaaefd370 tlsglobals-f90`::AMPI_threadstart(data=0x0000aaaaab448ed0) at ampi.C:1155:30
frame #12: 0x0000aaaaaae5ba28 tlsglobals-f90`::startTCharmThread(msg=0x0000aaaaab448eb0) at tcharm.C:182:10
frame #13: 0x0000aaaaab15b540 tlsglobals-f90`CthStartThread(fn1=43690, fn2=2867182072, arg1=43690, arg2=2873396912) at threads.C:1795:8
frame #14: 0x0000fffff78bab80 libc.so.6 at setcontext.S:123
Should CMI_IO_BUFFER_EXPLICIT
become a runtime option rather than a build-time one? We could still enable it by default at build-time for certain layers/machines. That way we could enable it at runtime when tlsglobals or pieglobals is requested.
#3743 does the cleanup necessary to test CMI_IO_BUFFER_EXPLICIT on netlrts. The tests are able to pass, but they are frequently obstructed by an unrelated error very similar to what I see with #3729: After migrating, MPI_Comm_rank fails complaining that AMPI> cannot call MPI routines from non-AMPI threads!
. It looks like an AMPI rank is resuming before all its contents have migrated in.
Have the recent fixes for migration on arm64 changed this?
With the fix in place, what I observe has only gotten weirder. All tests pass locally on the main
branch, but with this PR the tlsglobals test still fails. I tried applying the same kind of no-inline encapsulation of all CtvAccess
es in AMPI and the failures take some new forms, including ones that fail in different ways depending on whether a debugger is attached or not. This is with CMI_IO_BUFFER_EXPLICIT enabled as well.
I'm beginning to suspect the hang in a futex inside these print calls is a red herring, and the issue is actually something to do with the TLS segment. It might be related to how ARM processors use a weaker memory model than x86. It could also be related to how x86 uses the -mno-direct-tls-seg-refs
option which may influence the codegen of TLS pointer calculations in ways that mask the issue there. Or, even more likely, how we don't (yet) use the fsgsbase
instructions to fully swap the TLS segment address and so libc internals might operate like tlsglobals is not active at all on x86, which is not the case on ARM.
Either way, the combination of using TLS for both Cpvs and for privatized MPI data at the same time seems to be not fully ironed out. I notice some functions in conv-core/threads.C are tagged CMK_NOOPTIMIZE and I need to dig into the history of why that is.
I'm going to try conditioning noexcept
on !CMK_ERROR_CHECKING
in case that is masking debugging efforts. I also might try eliminating CtvAccessOther, and I want to audit all handling of these variables. Another thing that I think might help, and would complement the fsgsbase
hypothesis, is to refresh the contents of the TLS segment after migration by copying the portion for the system and runtime, but not the user program, from the system segment to the user segment. This may require reconfiguring tlsglobals so that it builds the user program into a shared object like the newer privatization methods, in order to measure where exactly the boundaries are. (This would still link at build time rather than dynamically like those methods.)
I have also switched to developing this using an NVIDIA Jetson Orin Nano to rule out any potential stability issues with the experimental Asahi Linux software stack on an M1 Mac. To their credit I see the same issues on both.