charm icon indicating copy to clipboard operation
charm copied to clipboard

Implement CmiTLS for arm64 and ppc64le

Open evan-charmworks opened this issue 3 years ago • 15 comments

evan-charmworks avatar Jun 15 '21 00:06 evan-charmworks

  • [x] Linux arm64/aarch64/arm8 verified correct by CI
  • [ ] MPI arm64 is failing?
  • [ ] ppc64le does not work for unknown reasons. This might be caused by alignment, the displacement value, or the assembly in setTLS().
  • [x] 32-bit arm7 will not work because the segment pointer register is read-only in user mode. Symptom: crash in setTLS, SIGILL (illegal instruction)
  • [x] macOS ARM64

evan-charmworks avatar Jun 15 '21 20:06 evan-charmworks

This pull request fixes 1 alert when merging 1858b4880cb5193889189fd7f88b63605f867328 into 99e9a26c44c1fc885f97ff0834e2eeac59910316 - view on LGTM.com

fixed alerts:

  • 1 for FIXME comment

lgtm-com[bot] avatar Jul 26 '21 23:07 lgtm-com[bot]

This pull request fixes 1 alert when merging fee41644f86dad3286863dbfed2bb587c946eb4d into 99e9a26c44c1fc885f97ff0834e2eeac59910316 - view on LGTM.com

fixed alerts:

  • 1 for FIXME comment

lgtm-com[bot] avatar Jul 27 '21 00:07 lgtm-com[bot]

Should we merge this patch and open an issue for TLS support on ppc64le (if there isn't already one)?

stwhite91 avatar Oct 14 '21 16:10 stwhite91

Should we merge this patch and open an issue for TLS support on ppc64le (if there isn't already one)?

Not yet, because the MPI-ARM64 CI is failing now.

evan-charmworks avatar Oct 14 '21 17:10 evan-charmworks

Should we merge this patch and open an issue for TLS support on ppc64le (if there isn't already one)?

Not yet, because the MPI-ARM64 CI is failing now.

This looks like a different error, so perhaps it's unrelated, but the MPI-ARM64 build has been very flaky for me today, often just not starting at all.

rbuch avatar Oct 14 '21 17:10 rbuch

It hangs in the tlsglobals test. That's a blocker.

evan-charmworks avatar Oct 14 '21 17:10 evan-charmworks

Perhaps we should separate this PR into mac-arm64 support and merge that, then open a separate issue/PR for mpi-arm64 and ppc64le support

Merging Apple Silicon support first is a good idea.

It is unfortunate that ARM and POWER CIs through Travis are no longer available to us.

evan-charmworks avatar Apr 14 '23 01:04 evan-charmworks

I've been able to look into the mpi-linux-arm8 issue under Asahi Linux on an M1 Mac. A null pointer is dereferenced inside libc during a printf call from the tlsglobals test. It can also happen on netlrts. There are similarities with #1858, #2814, #2932 but not close enough to be helpful.

My TLS implementation may not be correct, or there may be some issue related to how we privatize the entire thread-local storage of a process and not just the MPI user program.

evan-charmworks avatar Apr 17 '23 02:04 evan-charmworks

Updated this with the macOS parts split out, now merged, and the docs updated. No movement on the failures.

evan-charmworks avatar Apr 18 '23 06:04 evan-charmworks

I've debugged the AArch64 failures more and made the following observations:

  • If I remove all (f)printf statements from the privatization test, all tests pass. This even includes pieglobals with tlsglobals integrated! (#3567)
  • If I keep the statements but instead enable CMI_IO_BUFFER_EXPLICIT which manually calls setvbuf with a buffer the runtime manages, the tests pass on the MPI layer!
  • I suspect this would resolve netlrts too, but CMI_IO_BUFFER_EXPLICIT is disabled by CMK_CMIPRINTF_IS_A_BUILTIN which machine layers using the C++ Charmrun use. I see no fundamental conflict, it would just need a little refactoring to support both.

For reference, here are the backtraces for the issue:

symptom: crash

(lldb) bt
* thread #3, name = 'tlsglobals-cxx', stop reason = signal SIGSEGV: invalid address (fault address: 0x0)
  * frame #0: 0x0000fffff75b9958 libc.so.6`unwind_stop(version=<unavailable>, actions=<unavailable>, exc_class=<unavailable>, exc_obj=<unavailable>, context=<unavailable>, stop_parameter=0x000038e4e00ff960) at unwind.c:80:8
    frame #1: 0x0000fffff76fe828 libgcc_s.so.1`_Unwind_ForcedUnwind_Phase2(exc=0x000038e4e00ffcb0, context=0x000038e4e00ff1e0, frames_p=0x000038e4e00fee18) at unwind.inc:171:20
    frame #2: 0x0000fffff76fec5c libgcc_s.so.1`_Unwind_ForcedUnwind(exc=0x000038e4e00ffcb0, stop=(libc.so.6`unwind_stop at unwind.c:43:1), stop_argument=0x000038e4e00ff960) at unwind.inc:218:10
    frame #3: 0x0000fffff75b9a30 libc.so.6`__GI___pthread_unwind(buf=<unavailable>) at unwind.c:130:3
    frame #4: 0x0000fffff75acec4 libc.so.6`__GI___pthread_enable_asynccancel [inlined] __do_cancel at pthreadP.h:280:3
    frame #5: 0x0000fffff75acea0 libc.so.6`__GI___pthread_enable_asynccancel at cancellation.c:48:8
    frame #6: 0x0000fffff760c30c libc.so.6`__GI___libc_write at write.c:26:10
    frame #7: 0x0000fffff760c308 libc.so.6`__GI___libc_write(fd=<unavailable>, buf=0x0000ffffe8002a90, nbytes=55) at write.c:24:1
    frame #8: 0x0000fffff75a817c libc.so.6`_IO_new_file_write(f=0x0000fffff76d5440, data=0x0000ffffe8002a90, n=55) at fileops.c:1180:9
    frame #9: 0x0000fffff75a7548 libc.so.6`new_do_write(fp=0x0000fffff76d5440, data="#01 - [1](1) - 0x38e4e0100024 - privatization - passed\n", to_do=55) at fileops.c:448:11
    frame #10: 0x0000fffff75a9260 libc.so.6`_IO_new_do_write(fp=<unavailable>, data=<unavailable>, to_do=55) at fileops.c:425:16
    frame #11: 0x0000fffff75a8840 libc.so.6`_IO_new_file_xsputn at fileops.c:1243:11
    frame #12: 0x0000fffff75a87ec libc.so.6`_IO_new_file_xsputn(f=0x0000fffff76d5440, data=<unavailable>, n=1) at fileops.c:1196:1
    frame #13: 0x0000fffff7593634 libc.so.6`__vfprintf_internal at vfprintf-internal.c:239:16
    frame #14: 0x0000fffff7593618 libc.so.6`__vfprintf_internal(s=0x0000fffff76d5440, format="#%02d - [%d](%d) - 0x%012lx - %s - %s\n", ap=va_list @ 0x0000aaaaabfa9260, mode_flags=0) at vfprintf-internal.c:1593:7
    frame #15: 0x0000fffff7582364 libc.so.6`__printf(format=<unavailable>) at printf.c:33:10
    frame #16: 0x0000aaaaaae66860 tlsglobals-cxx`::print_test_result(test=1, rank=1, my_wth=1, ptr=0x000038e4e0100024, name="privatization", result=1) at framework.C:32:9
    frame #17: 0x0000aaaaaae66934 tlsglobals-cxx`::test_privatization_(failed=0x000038e4e00fff1c, test=0x000038e4e00fff24, rank=0x000038e4e00fff3c, my_wth=0x000038e4e00fff28, operation=0x000038e4e00fff20, global=0x000038e4e0100024) at framework.C:51:32
    frame #18: 0x0000aaaaaae66d14 tlsglobals-cxx`::perform_test_batch_(failed=0x000038e4e00fff1c, test=0x000038e4e00fff24, rank=0x000038e4e00fff3c, my_wth=0x000038e4e00fff28, operation=0x000038e4e00fff20) at test.C:110:21
    frame #19: 0x0000aaaaaae66ab4 tlsglobals-cxx`::perform_test_batch_dispatch(failed=0x000038e4e00fff1c, test=0x000038e4e00fff24, rank=0x000038e4e00fff3c, my_wth=0x000038e4e00fff28, operation=0x000038e4e00fff20) at framework.C:81:21
    frame #20: 0x0000aaaaaae66b60 tlsglobals-cxx`::privatization_test_framework_() at framework.C:104:30
    frame #21: 0x0000aaaaaae66f50 tlsglobals-cxx`::AMPI_Main(argc=1, argv=0x000038e4e010c3c0) at test.C:259:31
    frame #22: 0x0000aaaaaaf0b210 tlsglobals-cxx`::AMPI_threadstart(data=0x0000ffffe8042f40) at ampi.C:1154:19
    frame #23: 0x0000aaaaaae67a98 tlsglobals-cxx`::startTCharmThread(msg=0x0000ffffe8042f20) at tcharm.C:182:10
    frame #24: 0x0000aaaaab123fdc tlsglobals-cxx`CthStartThread(fn1=43690, fn2=2867231336, arg1=65535, arg2=3892588320) at threads.C:1795:8
    frame #25: 0x0000fffff757ab80 libc.so.6 at setcontext.S:123

symptom: hang

(lldb) bt
* thread #1, name = 'tlsglobals-f90', stop reason = signal SIGSTOP
  * frame #0: 0x0000fffff78ecf44 libc.so.6`__GI___pthread_disable_asynccancel at futex-internal.h:146:13
    frame #1: 0x0000fffff794c338 libc.so.6`__GI___libc_write at write.c:26:10
    frame #2: 0x0000fffff794c308 libc.so.6`__GI___libc_write(fd=<unavailable>, buf=0x0000aaaaab3eb820, nbytes=28) at write.c:24:1
    frame #3: 0x0000fffff78e817c libc.so.6`_IO_new_file_write(f=0x0000fffff7a15440, data=0x0000aaaaab3eb820, n=28) at fileops.c:1180:9
    frame #4: 0x0000fffff78e7548 libc.so.6`new_do_write(fp=0x0000fffff7a15440, data="Beginning round of testing.\nee with: GreedyRefine \ndefault configuration.\nance benchmarking (build without --enable-error-checking to do so).\n..."..., to_do=28) at fileops.c:448:11
    frame #5: 0x0000fffff78e9260 libc.so.6`_IO_new_do_write(fp=0x0000fffff7a15440, data=<unavailable>, to_do=28) at fileops.c:425:16
    frame #6: 0x0000fffff78e9680 libc.so.6`_IO_new_file_overflow(f=0x0000fffff7a15440, ch=10) at fileops.c:783:9
    frame #7: 0x0000fffff78ddc10 libc.so.6`__GI__IO_puts(str="Beginning round of testing.") at ioputs.c:41:10
    frame #8: 0x0000aaaaaae5ab5c tlsglobals-f90`::perform_test_batch_dispatch(failed=0x00000004000fff1c, test=0x00000004000fff24, rank=0x00000004000fff3c, my_wth=0x00000004000fff28, operation=0x00000004000fff20) at framework.C:79:11
    frame #9: 0x0000aaaaaae5ac20 tlsglobals-f90`::privatization_test_framework_() at framework.C:104:30
    frame #10: 0x0000aaaaaae5aed0 tlsglobals-f90`mpi_main_ at test-tlsglobals.f90:61:43
    frame #11: 0x0000aaaaaaefd370 tlsglobals-f90`::AMPI_threadstart(data=0x0000aaaaab448ed0) at ampi.C:1155:30
    frame #12: 0x0000aaaaaae5ba28 tlsglobals-f90`::startTCharmThread(msg=0x0000aaaaab448eb0) at tcharm.C:182:10
    frame #13: 0x0000aaaaab15b540 tlsglobals-f90`CthStartThread(fn1=43690, fn2=2867182072, arg1=43690, arg2=2873396912) at threads.C:1795:8
    frame #14: 0x0000fffff78bab80 libc.so.6 at setcontext.S:123

evan-charmworks avatar Apr 23 '23 23:04 evan-charmworks

Should CMI_IO_BUFFER_EXPLICIT become a runtime option rather than a build-time one? We could still enable it by default at build-time for certain layers/machines. That way we could enable it at runtime when tlsglobals or pieglobals is requested.

stwhite91 avatar Apr 26 '23 16:04 stwhite91

#3743 does the cleanup necessary to test CMI_IO_BUFFER_EXPLICIT on netlrts. The tests are able to pass, but they are frequently obstructed by an unrelated error very similar to what I see with #3729: After migrating, MPI_Comm_rank fails complaining that AMPI> cannot call MPI routines from non-AMPI threads!. It looks like an AMPI rank is resuming before all its contents have migrated in.

evan-charmworks avatar Jun 26 '23 05:06 evan-charmworks

Have the recent fixes for migration on arm64 changed this?

stwhite91 avatar Jul 17 '23 18:07 stwhite91

With the fix in place, what I observe has only gotten weirder. All tests pass locally on the main branch, but with this PR the tlsglobals test still fails. I tried applying the same kind of no-inline encapsulation of all CtvAccesses in AMPI and the failures take some new forms, including ones that fail in different ways depending on whether a debugger is attached or not. This is with CMI_IO_BUFFER_EXPLICIT enabled as well.

I'm beginning to suspect the hang in a futex inside these print calls is a red herring, and the issue is actually something to do with the TLS segment. It might be related to how ARM processors use a weaker memory model than x86. It could also be related to how x86 uses the -mno-direct-tls-seg-refs option which may influence the codegen of TLS pointer calculations in ways that mask the issue there. Or, even more likely, how we don't (yet) use the fsgsbase instructions to fully swap the TLS segment address and so libc internals might operate like tlsglobals is not active at all on x86, which is not the case on ARM.

Either way, the combination of using TLS for both Cpvs and for privatized MPI data at the same time seems to be not fully ironed out. I notice some functions in conv-core/threads.C are tagged CMK_NOOPTIMIZE and I need to dig into the history of why that is.

I'm going to try conditioning noexcept on !CMK_ERROR_CHECKING in case that is masking debugging efforts. I also might try eliminating CtvAccessOther, and I want to audit all handling of these variables. Another thing that I think might help, and would complement the fsgsbase hypothesis, is to refresh the contents of the TLS segment after migration by copying the portion for the system and runtime, but not the user program, from the system segment to the user segment. This may require reconfiguring tlsglobals so that it builds the user program into a shared object like the newer privatization methods, in order to measure where exactly the boundaries are. (This would still link at build time rather than dynamically like those methods.)

I have also switched to developing this using an NVIDIA Jetson Orin Nano to rule out any potential stability issues with the experimental Asahi Linux software stack on an M1 Mac. To their credit I see the same issues on both.

evan-charmworks avatar Jul 17 '23 18:07 evan-charmworks