NECI_STABLE icon indicating copy to clipboard operation
NECI_STABLE copied to clipboard

back_spawn tests failing with SEGFAULT

Open boegel opened this issue 1 year ago • 8 comments

When building the latest version of NECI (commit 558e88c5ae6c30d0505a9badbc69111be0866ba1) with GCC 12.3.0, we're consistently seeing segfaults in the tests, even when a lot of memory is available on the system (~185GB, with 36 cores).

94% tests passed, 6 tests failed out of 96

Total Test time (real) = 372.92 sec

The following tests FAILED:
         22 - test_neci_back_spawn_excit_gen (SEGFAULT)
         23 - test_kneci_back_spawn_excit_gen (SEGFAULT)
         24 - test_dneci_back_spawn_excit_gen (SEGFAULT)
         25 - test_mneci_back_spawn_excit_gen (SEGFAULT)
         26 - test_kdneci_back_spawn_excit_gen (SEGFAULT)
         27 - test_kmneci_back_spawn_excit_gen (SEGFAULT)

More details on test_neci_back_spawn_excit_gen below incl. backtrace, the actual full error is "Program received signal SIGSEGV: Segmentation fault - invalid memory reference." (other tests fail in a very similar way, and always in back_spawn.F90:593):

      Start 22: test_neci_back_spawn_excit_gen
22/96 Test #22: test_neci_back_spawn_excit_gen .............***Exception: SegFault  0.21 sec

 Test module initialized

    . : successful assert,   F : failed assert


 testing: calc_pgen_back_spawn_ueg
 with necessary global data:
 nel:            2
 nOccBeta:            1
 nOccAlpha:            1
 tUEG:  T
 tNoFailAb:  F
 projedet:            1           2
 dSFMT_init()
 niftot:            2
 n_int:            8
 set_flag()
 get_initiator_flag()
 flag:            1
 nBasis:            4
.........
 testing: calc_pgen_back_spawn_hubbard

 testing: calc_pgen_back_spawn_hubbard
 with necessary global data:
 nel:            2
 nOccBeta:            1
 nOccAlpha:            1
 tHub:  T
 projedet:            1           2
 dSFMT_init()
 niftot:            2
 n_int:            8
 set_flag()
 get_initiator_flag()
 IlutBits%ind_flag:            1
 nBasis:            4
.......
 testing: calc_pgen_back_spawn_ueg
 with necessary global data:
 nel:            2
 nBasis:            4
 niftot:            1
 encodebitdet()
 dSFMT_init
......
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0  0x147aa063e6ef in ???
#1  0x420ebe in __back_spawn_MOD_pick_occupied_orbital_ueg
        at /tmp/easybuild_build/NECI/20230620/foss-2023a/NECI_STABLE/src/back_spawn.F90:593
#2  0x425a68 in __back_spawn_excit_gen_MOD_calc_pgen_back_spawn_ueg_new
        at /tmp/easybuild_build/NECI/20230620/foss-2023a/NECI_STABLE/src/back_spawn_excit_gen.F90:284
#3  0x407c4e in calc_pgen_back_spawn_ueg_new_test
        at /tmp/easybuild_build/NECI/20230620/foss-2023a/NECI_STABLE/unit_tests/back_spawn_excit_gen/test_back_spawn_excit_gen.F90:487
#4  0x785d3d in __fruit_MOD_run_test_case_named_
        at /tmp/easybuild_build/NECI/20230620/foss-2023a/NECI_STABLE/unit_tests/fruit_src/fruit.f90:909
#5  0x407108 in back_spawn_excit_gen_test_driver
        at /tmp/easybuild_build/NECI/20230620/foss-2023a/NECI_STABLE/unit_tests/back_spawn_excit_gen/test_back_spawn_excit_gen.F90:29
#6  0x407108 in test_back_spawn_excit_gen
        at /tmp/easybuild_build/NECI/20230620/foss-2023a/NECI_STABLE/unit_tests/back_spawn_excit_gen/test_back_spawn_excit_gen.F90:16
#7  0x407108 in main
        at /tmp/easybuild_build/NECI/20230620/foss-2023a/NECI_STABLE/unit_tests/back_spawn_excit_gen/test_back_spawn_excit_gen.F90:5

Are there any known problems when NECI is built with a recent compiler? Any suggestions on how to get to the bottom of this, could it be a bug in back_spawn.F90?

Although we've seen very similar problems before with an older version of NECI and GCC 11.3.0 (see https://github.com/easybuilders/easybuild-easyconfigs/issues/17164), we didn't observe these problems when using GCC 12.2.0 to build NECI commit 558e88c5ae6c30d0505a9badbc69111be0866ba1, which seems strange to me...

boegel avatar Nov 13 '24 10:11 boegel

Thank you very much for taking the time for writing this. I tested NECI, commit https://github.com/ghb24/NECI_STABLE/commit/558e88c5ae6c30d0505a9badbc69111be0866ba1, by compiling it with GCC 12.3.0, OpenMPI 4.2.5, on both Intel and AMD hardwares. I could not reproduce the issue.

I compile the program by using: cmake -DCMAKE_Fortran_COMPILER=mpifort -DCMAKE_CXX_COMPILER=mpic++ -DCMAKE_C_COMPILER=mpicc /path/to/source

Could please tell me the following info?

  • How you compile NECI
  • MPI version
  • OS and libgfortran version

demetriovilardi avatar Nov 15 '24 09:11 demetriovilardi

@demetriovilardi Thank you for looking into this.

We saw the segfaults happen on both RHEL 8.8 and 9.4 systems, across a range of CPU types (different generations of Intel & AMD). OpenMPI version we're using for this is v4.1.5. gfortran version is 12.3.0 (we install our own GCC)

We build NECI via EasyBuild, which uses the following cmake command:

cmake  -DCMAKE_INSTALL_PREFIX=/software/NECI/20230620-foss-2023a -DCMAKE_BUILD_TYPE=Release -DCMAKE_VERBOSE_MAKEFILE=ON -DCMAKE_FIND_USE_PACKAGE_REGISTRY=OFF -DPYTHON_EXECUTABLE=/software/Python/3.11.3-GCCcore-12.3.0/bin/python -DPython_EXECUTABLE=/software/Python/3.11.3-GCCcore-12.3.0/bin/python -DPython3_EXECUTABLE=/software/Python/3.11.3-GCCcore-12.3.0/bin/python -DENABLE_HDF5=ON

In addition, we set $CFLAGS, $CXXFLAGS and $F90FLAGS to "-O2 -ftree-vectorize -march=native -fno-math-errno" (and cmake should pick up on this)

boegel avatar Nov 16 '24 14:11 boegel

In addition, we set $CFLAGS, $CXXFLAGS and $F90FLAGS to "-O2 -ftree-vectorize -march=native -fno-math-errno" (and cmake should pick up on this)

Just a quick comment on this one.¹ NECI anyway uses -O3 -march=native for Release builds and -ftree-vectorize is implied with O3; is there a specific reason you want to avoid O3?

@demetriovilardi it seems like a good idea to also add -fno-math-errno to Release builds, isn't it?

¹ I changed groups and am not anymore one of the maintainers.

mcocdawc avatar Nov 16 '24 17:11 mcocdawc

In addition, we set $CFLAGS, $CXXFLAGS and $F90FLAGS to "-O2 -ftree-vectorize -march=native -fno-math-errno" (and cmake should pick up on this)

Just a quick comment on this one.¹ NECI anyway uses -O3 -march=native for Release builds and -ftree-vectorize is implied with O3; is there a specific reason you want to avoid O3?

This has grown historically I guess: -O2 used to be a lot more stable in terms of generating correct machine instructions than -O3, and at some point we noticed a significant benefit of enabling -ftree-vectorize. There may still be some compiler optimizations enabled with -O3 (and not in -O2) that could have a negative side effect, like on floating-point accuracy for example.

boegel avatar Nov 17 '24 11:11 boegel

After analysing the test with valgrind, we noticed an issue about a missing initialisation of a variable inside the test back_spawn_excit_gen.F90, that caused a memory problem at the line you specified. This issue doesn't affect the main program and we will fix it soon on our internal repository.

demetriovilardi avatar Nov 21 '24 10:11 demetriovilardi

@demetriovilardi Can you point us to the details of the fix, so we can apply a patch for this and have a test suite that passes in full?

boegel avatar Nov 21 '24 12:11 boegel

Essentially, inside unit_tests/back_spawn_excit_gen/test_back_spawn_excit_gen.F90, the array KPointToBasisFn is not initialised after allocation, in test_back_spawn_excit_gen.F90:66 and test_back_spawn_excit_gen.F90:405. One can add

KPointToBasisFn = -1

one line after allocation in both cases. After this change I get no further valgrind warnings or errors.

demetriovilardi avatar Nov 21 '24 12:11 demetriovilardi

@PetrKralCZ Can you look into making a patch file for NECI based on this info, and see if that fixes the segfaults we were seeing in the test suite?

boegel avatar Nov 21 '24 12:11 boegel