charm icon indicating copy to clipboard operation
charm copied to clipboard

MSA Examples Failing

Open jszaday opened this issue 3 years ago • 2 comments

Two of the MSA examples are broken.

examples/multiphaseSharedArrays/matmul does not compile. After superficial fixes, it will crash with:

Running as 1 OS processes: t2d 2 1048576 100 500 100 1
charmrun> /usr/bin/setarch x86_64 -R mpirun -np 1 t2d 2 1048576 100 500 100 1
Charm++> Running in non-SMP mode: 1 processes (PEs)
Converse/Charm++ Commit ID: v7.1.0-devel-132-g2d58c2fb7
Charm++ built with internal error checking enabled.
Do not use for performance benchmarking (build without --enable-error-checking to do so).
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 hosts (1 sockets x 4 cores x 2 PUs = 8-way SMP)
Charm++> cpu topology info is gathered in 0.102 seconds.
[cordelia:160910:0:160910] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2e0)
1	100	500	500	100	2	1048576	U	0.047026	5000	1	cordelia.local
==== backtrace (tid: 160910) ====
 0  /home/szaday2/workspace/ucx/build/lib/libucs.so.0(ucs_handle_error+0x2e4) [0x7ffff7dae534]
 1  /home/szaday2/workspace/ucx/build/lib/libucs.so.0(+0x2d76f) [0x7ffff7dae76f]
 2  /home/szaday2/workspace/ucx/build/lib/libucs.so.0(+0x2da56) [0x7ffff7daea56]
 3  /lib/x86_64-linux-gnu/libc.so.6(+0x46520) [0x7ffff784e520]
 4  t2d(_ZN14MSA_CacheGroupId12DefaultEntryIdLb0EELj5000EE10accessPageEj16MSA_Page_Fault_t+0x1a) [0x4b638a]
 5  t2d(_ZN17CkIndex_TestArray22_callthr_Kontinue_voidEP12CkThrCallArg+0x3f8) [0x4ac5f8]
 6  t2d(CthStartThread+0x12) [0x5e68e2]
 7  t2d(make_fcontext+0x2f) [0x5e6d5f]
=================================
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node cordelia exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

real	0m2.385s
user	0m0.077s
sys	0m0.043s
make: *** [Makefile:52: test] Error 139

At the time of the failure, the state of the cache group (MSA_CacheGroup::pageTable in particular) seems to be invalid.

Likewise, examples/multiphaseSharedArrays/moldyn does not compile. After superficial fixes, it will hang.

jszaday avatar Jan 23 '22 00:01 jszaday

How do you even build the msa library along with LIBS? Do you put them in quotes with the build script: ./build "target1 target2 ..." as in ./build "LIBS msa"?

BJWiley233 avatar Jan 27 '22 05:01 BJWiley233

I am unsure about how to compile MSA with the build script.

I typically run make from src/libs/ck-libs/multiphaseSharedArrays/ to make -module msa available.

jszaday avatar Jan 27 '22 14:01 jszaday