superlu_dist
superlu_dist copied to clipboard
Some tests are failing on intel compilers / intel mpi
The SuperLU_DIST is build by cmake and tests are run as:
export ARGS="$ARGS --tests-regex pdtest_[12]x1_[13]_2_8_20_SP" && make test
which results in this call:
ctest --force-new-ctest-process --tests-regex pdtest_[12]x1_[13]_2_8_20_SP
pdtest_1x1_1_2_8_20_SP and pdtest_1x1_3_2_8_20_SP fail, while pdtest_2x1_1_2_8_20_SP and pdtest_2x1_3_2_8_20_SP pass
The tests also passes under GCC compilers and OpenBLAS.
Test log:
Running tests...
/vscmnt/gent_kyukon_apps/_kyukon_home_apps/CO7/skylake-ib/software/CMake/3.16.4-GCCcore-9.3.0/bin/ctest --force-new-ctest-process --tests-regex pdtest_[12]x1_[13]_2_8_20_SP
Test project /tmp/vsc43143/easybuild/build/SuperLU_DIST/6.4.0/intel-2020a/easybuild_obj
Start 1: pdtest_1x1_1_2_8_20_SP
1/4 Test #1: pdtest_1x1_1_2_8_20_SP ...........***Failed 1.02 sec
Time to read and distribute matrix 0.00
[node3201:43390:0:43401] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x15c)
==== backtrace (tid: 43401) ====
0 0x00000000000214be ucs_debug_print_backtrace() /tmp/vsc40003/easybuild/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/ucs/debug/debug.c:653
1 0x000000000034a552 mkl_blas_avx512_xdgemv() ???:0
2 0x000000000022bb4e mkl_blas_xdgemv() ???:0
3 0x0000000000250b6a mkl_blas_dgemv() ???:0
4 0x00000000002fb9f9 mkl_blas_dgemm() ???:0
5 0x000000000019251f DGEMM() ???:0
6 0x0000000000476fea dlsum_bmod_inv() ???:0
7 0x0000000000476ee1 dlsum_bmod_inv() ???:0
8 0x00000000000e12c2 _INTERNALfadf56ac::__kmp_invoke_task() /nfs/site/proj/openmp/promo/20200205/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.t1..h1.w1-fxe16lin03/../../src/kmp_tasking.cpp:1782
9 0x00000000000ecdcc _INTERNALfadf56ac::__kmp_execute_tasks_template<kmp_flag_64<false, true> >() /nfs/site/proj/openmp/promo/20200205/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.t1..h1.w1-fxe16lin03/../../src/kmp_tasking.cpp:3185
10 0x00000000000ecdcc __kmp_execute_tasks_64<false, true>() /nfs/site/proj/openmp/promo/20200205/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.t1..h1.w1-fxe16lin03/../../src/kmp_tasking.cpp:3284
11 0x000000000006bb01 kmp_flag_64<false, true>::execute_tasks() /nfs/site/proj/openmp/promo/20200205/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.t1..h1.w1-fxe16lin03/../../src/kmp_wait_release.h:964
12 0x000000000006bb01 _INTERNAL51694e09::__kmp_wait_template<kmp_flag_64<false, true>, true, false, true>() /nfs/site/proj/openmp/promo/20200205/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.t1..h1.w1-fxe16lin03/../../src/kmp_wait_release.h:37
4
13 0x000000000006e23d kmp_flag_64<false, true>::wait() /nfs/site/proj/openmp/promo/20200205/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.t1..h1.w1-fxe16lin03/../../src/kmp_wait_release.h:971
14 0x000000000007526b __kmp_fork_barrier() /nfs/site/proj/openmp/promo/20200205/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.t1..h1.w1-fxe16lin03/../../src/kmp_barrier.cpp:2369
15 0x00000000000b1170 __kmp_launch_thread() /nfs/site/proj/openmp/promo/20200205/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.t1..h1.w1-fxe16lin03/../../src/kmp_runtime.cpp:6080
16 0x000000000012d19c _INTERNAL27dd4e00::__kmp_launch_worker() /nfs/site/proj/openmp/promo/20200205/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.t1..h1.w1-fxe16lin03/../../src/z_Linux_util.cpp:593
17 0x0000000000007ea5 start_thread() pthread_create.c:0
18 0x00000000000fe8dd __clone() ???:0
=================================
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 43390 RUNNING AT node3201.victini.os
= KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================
Start 2: pdtest_1x1_3_2_8_20_SP
2/4 Test #2: pdtest_1x1_3_2_8_20_SP ...........***Failed 0.77 sec
Time to read and distribute matrix 0.00
[node3201:43411:1:43426] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b4b0000009a)
[node3201:43411:0:43427] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b4b00000131)
[node3201:43411:2:43423] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b4b0000015b)
==== backtrace (tid: 43423) ====
0 0x00000000000214be ucs_debug_print_backtrace() /tmp/vsc40003/easybuild/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/ucs/debug/debug.c:653
1 0x0000000000cc2d3e mkl_blas_avx512_dgemm_kernel_nocopy_NN_b0() ???:0
=================================
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 43411 RUNNING AT node3201.victini.os
= KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================
Start 7: pdtest_2x1_1_2_8_20_SP
3/4 Test #7: pdtest_2x1_1_2_8_20_SP ........... Passed 0.38 sec
Start 8: pdtest_2x1_3_2_8_20_SP
4/4 Test #8: pdtest_2x1_3_2_8_20_SP ........... Passed 3.07 sec
50% tests passed, 2 tests failed out of 4
Total Test time (real) = 5.24 sec
The following tests FAILED:
1 - pdtest_1x1_1_2_8_20_SP (Failed)
2 - pdtest_1x1_3_2_8_20_SP (Failed)
Errors while running CTest
make: *** [test] Error 8
What machine and OS are you using? It passed the cases with 2 processes: process grid 2x1, but failed with the cases of 1 process: process grid 1x1. The other parameters are the same.