superlu_dist icon indicating copy to clipboard operation
superlu_dist copied to clipboard

Some tests are failing on intel compilers / intel mpi

Open Darkless012 opened this issue 4 years ago • 1 comments

The SuperLU_DIST is build by cmake and tests are run as: export ARGS="$ARGS --tests-regex pdtest_[12]x1_[13]_2_8_20_SP" && make test which results in this call: ctest --force-new-ctest-process --tests-regex pdtest_[12]x1_[13]_2_8_20_SP

pdtest_1x1_1_2_8_20_SP and pdtest_1x1_3_2_8_20_SP fail, while pdtest_2x1_1_2_8_20_SP and pdtest_2x1_3_2_8_20_SP pass

The tests also passes under GCC compilers and OpenBLAS.

Test log:

Running tests...
/vscmnt/gent_kyukon_apps/_kyukon_home_apps/CO7/skylake-ib/software/CMake/3.16.4-GCCcore-9.3.0/bin/ctest --force-new-ctest-process  --tests-regex pdtest_[12]x1_[13]_2_8_20_SP
Test project /tmp/vsc43143/easybuild/build/SuperLU_DIST/6.4.0/intel-2020a/easybuild_obj
    Start 1: pdtest_1x1_1_2_8_20_SP
1/4 Test #1: pdtest_1x1_1_2_8_20_SP ...........***Failed    1.02 sec
Time to read and distribute matrix 0.00
[node3201:43390:0:43401] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x15c)
==== backtrace (tid:  43401) ====
 0 0x00000000000214be ucs_debug_print_backtrace()  /tmp/vsc40003/easybuild/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/ucs/debug/debug.c:653
 1 0x000000000034a552 mkl_blas_avx512_xdgemv()  ???:0
 2 0x000000000022bb4e mkl_blas_xdgemv()  ???:0
 3 0x0000000000250b6a mkl_blas_dgemv()  ???:0
 4 0x00000000002fb9f9 mkl_blas_dgemm()  ???:0
 5 0x000000000019251f DGEMM()  ???:0
 6 0x0000000000476fea dlsum_bmod_inv()  ???:0
 7 0x0000000000476ee1 dlsum_bmod_inv()  ???:0
 8 0x00000000000e12c2 _INTERNALfadf56ac::__kmp_invoke_task()  /nfs/site/proj/openmp/promo/20200205/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.t1..h1.w1-fxe16lin03/../../src/kmp_tasking.cpp:1782
 9 0x00000000000ecdcc _INTERNALfadf56ac::__kmp_execute_tasks_template<kmp_flag_64<false, true> >()  /nfs/site/proj/openmp/promo/20200205/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.t1..h1.w1-fxe16lin03/../../src/kmp_tasking.cpp:3185
10 0x00000000000ecdcc __kmp_execute_tasks_64<false, true>()  /nfs/site/proj/openmp/promo/20200205/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.t1..h1.w1-fxe16lin03/../../src/kmp_tasking.cpp:3284
11 0x000000000006bb01 kmp_flag_64<false, true>::execute_tasks()  /nfs/site/proj/openmp/promo/20200205/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.t1..h1.w1-fxe16lin03/../../src/kmp_wait_release.h:964
12 0x000000000006bb01 _INTERNAL51694e09::__kmp_wait_template<kmp_flag_64<false, true>, true, false, true>()  /nfs/site/proj/openmp/promo/20200205/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.t1..h1.w1-fxe16lin03/../../src/kmp_wait_release.h:37
4
13 0x000000000006e23d kmp_flag_64<false, true>::wait()  /nfs/site/proj/openmp/promo/20200205/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.t1..h1.w1-fxe16lin03/../../src/kmp_wait_release.h:971
14 0x000000000007526b __kmp_fork_barrier()  /nfs/site/proj/openmp/promo/20200205/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.t1..h1.w1-fxe16lin03/../../src/kmp_barrier.cpp:2369
15 0x00000000000b1170 __kmp_launch_thread()  /nfs/site/proj/openmp/promo/20200205/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.t1..h1.w1-fxe16lin03/../../src/kmp_runtime.cpp:6080
16 0x000000000012d19c _INTERNAL27dd4e00::__kmp_launch_worker()  /nfs/site/proj/openmp/promo/20200205/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.t1..h1.w1-fxe16lin03/../../src/z_Linux_util.cpp:593
17 0x0000000000007ea5 start_thread()  pthread_create.c:0
18 0x00000000000fe8dd __clone()  ???:0
=================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 0 PID 43390 RUNNING AT node3201.victini.os
=   KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================

    Start 2: pdtest_1x1_3_2_8_20_SP
2/4 Test #2: pdtest_1x1_3_2_8_20_SP ...........***Failed    0.77 sec
Time to read and distribute matrix 0.00
[node3201:43411:1:43426] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b4b0000009a)
[node3201:43411:0:43427] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b4b00000131)
[node3201:43411:2:43423] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b4b0000015b)
==== backtrace (tid:  43423) ====
 0 0x00000000000214be ucs_debug_print_backtrace()  /tmp/vsc40003/easybuild/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/ucs/debug/debug.c:653
 1 0x0000000000cc2d3e mkl_blas_avx512_dgemm_kernel_nocopy_NN_b0()  ???:0
=================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 0 PID 43411 RUNNING AT node3201.victini.os
=   KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================

    Start 7: pdtest_2x1_1_2_8_20_SP
3/4 Test #7: pdtest_2x1_1_2_8_20_SP ...........   Passed    0.38 sec
    Start 8: pdtest_2x1_3_2_8_20_SP
4/4 Test #8: pdtest_2x1_3_2_8_20_SP ...........   Passed    3.07 sec

50% tests passed, 2 tests failed out of 4

Total Test time (real) =   5.24 sec

The following tests FAILED:
          1 - pdtest_1x1_1_2_8_20_SP (Failed)
          2 - pdtest_1x1_3_2_8_20_SP (Failed)
Errors while running CTest
make: *** [test] Error 8

Darkless012 avatar Nov 26 '20 19:11 Darkless012

What machine and OS are you using? It passed the cases with 2 processes: process grid 2x1, but failed with the cases of 1 process: process grid 1x1. The other parameters are the same.

xiaoyeli avatar Nov 29 '20 20:11 xiaoyeli