back_spawn tests failing with SEGFAULT
When building the latest version of NECI (commit 558e88c5ae6c30d0505a9badbc69111be0866ba1) with GCC 12.3.0, we're consistently seeing segfaults in the tests, even when a lot of memory is available on the system (~185GB, with 36 cores).
94% tests passed, 6 tests failed out of 96
Total Test time (real) = 372.92 sec
The following tests FAILED:
22 - test_neci_back_spawn_excit_gen (SEGFAULT)
23 - test_kneci_back_spawn_excit_gen (SEGFAULT)
24 - test_dneci_back_spawn_excit_gen (SEGFAULT)
25 - test_mneci_back_spawn_excit_gen (SEGFAULT)
26 - test_kdneci_back_spawn_excit_gen (SEGFAULT)
27 - test_kmneci_back_spawn_excit_gen (SEGFAULT)
More details on test_neci_back_spawn_excit_gen below incl. backtrace, the actual full error is "Program received signal SIGSEGV: Segmentation fault - invalid memory reference." (other tests fail in a very similar way, and always in back_spawn.F90:593):
Start 22: test_neci_back_spawn_excit_gen
22/96 Test #22: test_neci_back_spawn_excit_gen .............***Exception: SegFault 0.21 sec
Test module initialized
. : successful assert, F : failed assert
testing: calc_pgen_back_spawn_ueg
with necessary global data:
nel: 2
nOccBeta: 1
nOccAlpha: 1
tUEG: T
tNoFailAb: F
projedet: 1 2
dSFMT_init()
niftot: 2
n_int: 8
set_flag()
get_initiator_flag()
flag: 1
nBasis: 4
.........
testing: calc_pgen_back_spawn_hubbard
testing: calc_pgen_back_spawn_hubbard
with necessary global data:
nel: 2
nOccBeta: 1
nOccAlpha: 1
tHub: T
projedet: 1 2
dSFMT_init()
niftot: 2
n_int: 8
set_flag()
get_initiator_flag()
IlutBits%ind_flag: 1
nBasis: 4
.......
testing: calc_pgen_back_spawn_ueg
with necessary global data:
nel: 2
nBasis: 4
niftot: 1
encodebitdet()
dSFMT_init
......
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
#0 0x147aa063e6ef in ???
#1 0x420ebe in __back_spawn_MOD_pick_occupied_orbital_ueg
at /tmp/easybuild_build/NECI/20230620/foss-2023a/NECI_STABLE/src/back_spawn.F90:593
#2 0x425a68 in __back_spawn_excit_gen_MOD_calc_pgen_back_spawn_ueg_new
at /tmp/easybuild_build/NECI/20230620/foss-2023a/NECI_STABLE/src/back_spawn_excit_gen.F90:284
#3 0x407c4e in calc_pgen_back_spawn_ueg_new_test
at /tmp/easybuild_build/NECI/20230620/foss-2023a/NECI_STABLE/unit_tests/back_spawn_excit_gen/test_back_spawn_excit_gen.F90:487
#4 0x785d3d in __fruit_MOD_run_test_case_named_
at /tmp/easybuild_build/NECI/20230620/foss-2023a/NECI_STABLE/unit_tests/fruit_src/fruit.f90:909
#5 0x407108 in back_spawn_excit_gen_test_driver
at /tmp/easybuild_build/NECI/20230620/foss-2023a/NECI_STABLE/unit_tests/back_spawn_excit_gen/test_back_spawn_excit_gen.F90:29
#6 0x407108 in test_back_spawn_excit_gen
at /tmp/easybuild_build/NECI/20230620/foss-2023a/NECI_STABLE/unit_tests/back_spawn_excit_gen/test_back_spawn_excit_gen.F90:16
#7 0x407108 in main
at /tmp/easybuild_build/NECI/20230620/foss-2023a/NECI_STABLE/unit_tests/back_spawn_excit_gen/test_back_spawn_excit_gen.F90:5
Are there any known problems when NECI is built with a recent compiler?
Any suggestions on how to get to the bottom of this, could it be a bug in back_spawn.F90?
Although we've seen very similar problems before with an older version of NECI and GCC 11.3.0 (see https://github.com/easybuilders/easybuild-easyconfigs/issues/17164), we didn't observe these problems when using GCC 12.2.0 to build NECI commit 558e88c5ae6c30d0505a9badbc69111be0866ba1, which seems strange to me...
Thank you very much for taking the time for writing this. I tested NECI, commit https://github.com/ghb24/NECI_STABLE/commit/558e88c5ae6c30d0505a9badbc69111be0866ba1, by compiling it with GCC 12.3.0, OpenMPI 4.2.5, on both Intel and AMD hardwares. I could not reproduce the issue.
I compile the program by using:
cmake -DCMAKE_Fortran_COMPILER=mpifort -DCMAKE_CXX_COMPILER=mpic++ -DCMAKE_C_COMPILER=mpicc /path/to/source
Could please tell me the following info?
- How you compile NECI
- MPI version
- OS and libgfortran version
@demetriovilardi Thank you for looking into this.
We saw the segfaults happen on both RHEL 8.8 and 9.4 systems, across a range of CPU types (different generations of Intel & AMD). OpenMPI version we're using for this is v4.1.5. gfortran version is 12.3.0 (we install our own GCC)
We build NECI via EasyBuild, which uses the following cmake command:
cmake -DCMAKE_INSTALL_PREFIX=/software/NECI/20230620-foss-2023a -DCMAKE_BUILD_TYPE=Release -DCMAKE_VERBOSE_MAKEFILE=ON -DCMAKE_FIND_USE_PACKAGE_REGISTRY=OFF -DPYTHON_EXECUTABLE=/software/Python/3.11.3-GCCcore-12.3.0/bin/python -DPython_EXECUTABLE=/software/Python/3.11.3-GCCcore-12.3.0/bin/python -DPython3_EXECUTABLE=/software/Python/3.11.3-GCCcore-12.3.0/bin/python -DENABLE_HDF5=ON
In addition, we set $CFLAGS, $CXXFLAGS and $F90FLAGS to "-O2 -ftree-vectorize -march=native -fno-math-errno" (and cmake should pick up on this)
In addition, we set $CFLAGS, $CXXFLAGS and $F90FLAGS to "-O2 -ftree-vectorize -march=native -fno-math-errno" (and cmake should pick up on this)
Just a quick comment on this one.¹ NECI anyway uses -O3 -march=native for Release builds and -ftree-vectorize is implied with O3; is there a specific reason you want to avoid O3?
@demetriovilardi it seems like a good idea to also add -fno-math-errno to Release builds, isn't it?
¹ I changed groups and am not anymore one of the maintainers.
In addition, we set $CFLAGS, $CXXFLAGS and $F90FLAGS to "-O2 -ftree-vectorize -march=native -fno-math-errno" (and cmake should pick up on this)
Just a quick comment on this one.¹ NECI anyway uses
-O3 -march=nativefor Release builds and-ftree-vectorizeis implied withO3; is there a specific reason you want to avoid O3?
This has grown historically I guess: -O2 used to be a lot more stable in terms of generating correct machine instructions than -O3, and at some point we noticed a significant benefit of enabling -ftree-vectorize.
There may still be some compiler optimizations enabled with -O3 (and not in -O2) that could have a negative side effect, like on floating-point accuracy for example.
After analysing the test with valgrind, we noticed an issue about a missing initialisation of a variable inside the test back_spawn_excit_gen.F90, that caused a memory problem at the line you specified. This issue doesn't affect the main program and we will fix it soon on our internal repository.
@demetriovilardi Can you point us to the details of the fix, so we can apply a patch for this and have a test suite that passes in full?
Essentially, inside unit_tests/back_spawn_excit_gen/test_back_spawn_excit_gen.F90, the array KPointToBasisFn is not initialised after allocation, in test_back_spawn_excit_gen.F90:66 and test_back_spawn_excit_gen.F90:405. One can add
KPointToBasisFn = -1
one line after allocation in both cases. After this change I get no further valgrind warnings or errors.
@PetrKralCZ Can you look into making a patch file for NECI based on this info, and see if that fixes the segfaults we were seeing in the test suite?