cp2k Segfault with test files run against AMD GPUs using 2025.1

Attempting to update our internal copy of CP2K to 'v2025.1', I am finding that our AMD GPU build segfaults when run against with a few of the test files, specifically GFX90A and GFX942 models. The same tests run with the 'v2024.3' version works, so this appears to be a regression. In this specific instance, I am using ROCm 6.1.0.

The test files which are triggering this specific segfault are:

tests/SE/regtest-2-1/md.inp
tests/QS/regtest-ot-1/H2O-print.inp
tests/QS/regtest-gpw-3/NO2-mulliken.inp
tests/QS/regtest-gpw-4/H2O-debug-2.inp

Here are the contents of the 'arch' file used for AMD GPU builds:

# This expects the 'rocm' module to be loaded for the '${ROCM_PATH}'
# variable to be set as expected.

OFFLOAD_CC     = hipcc
CC             = cc
CXX            = CC
FC             = ftn
LD             = ftn
AR             = ar -r
GPUVER         = Mi250
HIPARCH        = <GPU_MODEL>

DFLAGS         = -D__OFFLOAD_HIP -D__GRID_HIP -D__HIP_PLATFORM_AMD__ \
                 -D__FFTW3 -D__parallel -D__SCALAPACK -D__ACC

#
WFLAGS      =
#
FCDEBFLAGS  =
CFLAGS         = -fno-omit-frame-pointer -fopenmp -g -mtune=native  \
                 $(PROFOPT)  -std=c11 -Wall -Wextra -Werror \
                 -Wno-vla-parameter -Wno-deprecated-declarations \
		 -Wno-implicit-function-declaration \
                 $(DFLAGS) -I${ROCM_PATH}/include
CXXFLAGS       =  --std=c++14 $(DFLAGS) -Wno-deprecated-declarations \
                  -fopenmp -Wall -Wextra -Werror -I${ROCM_PATH}/include
FCFLAGS        = -fno-omit-frame-pointer -fopenmp -funroll-loops \
                 -ftree-vectorize -ffree-form -ffree-line-length-512 \
                 -g -mtune=native $(PROFOPT)  $(FCDEBFLAGS) $(WFLAGS) \
                 $(DFLAGS) ${allow_arg_mismatch_flag}
LDFLAGS        = $(FCFLAGS) -Wl,--enable-new-dtags -L'${ROCM_PATH}/lib' \
                 -Wl,-rpath,'${ROCM_PATH}/lib'
OFFLOAD_FLAGS  = -g --offload-arch=${HIPARCH} -munsafe-fp-atomics \
                 -fopenmp -m64 -pthread \
                 -fPIC -O3 --std=c++11 -Wall -Wextra -Werror \
                 $(DFLAGS) -I${ROCM_PATH}/include
OFFLOAD_TARGET = hip
LIBS           = -lfftw3 -lfftw3_threads -lhipfft
LIBS          += -lstdc++ -lamdhip64

To build the code, I run the following using a Cray Programming Environment:

For the purposes of these instructions, I'll assume the arch file is called 'amd_gfx.psmp', has the contents shared above, and is in the 'arch/' directory.
Run: module load cray-fftw
Also run either module load craype-accel-amd-gfx90a or module load craype-accel-amd-gfx942 depending on the GPU model.
Make sure to have PrgEnv-gnu loaded.
Either run module load rocm or make sure 'ROCM_PATH' is set appropriately.
Set: export allow_arg_mismatch_flag="-fallow-argument-mismatch"
Set: HIP_MODEL=gfx90a or HIP_MODEL=gfx942 depending on the GPU model.
Run: sed -ie -e "s/<GPU_MODEL>/${HIP_MODEL}/" arch/amd_gfx.psmp
Proceed to build it: CC=cc CXX=CC make -j 4 ARCH=amd_gfx VERSION=psmp

I will also state that for the 'v2025.1' build, I am using the 'v2.8.0' tagged version of 'dbcsr'; with the 'v2024.3' build, we use the 'v2.7.0' tagged version of 'dbcsr'.

When I run using the 'tests/SE/regtest-2-1/md.inp' with 2 MPI tasks, this is the output I get at the end; the end output doesn't matter whether one node is used or 2:

...
 PW_GRID| Information for grid number                                          2
 PW_GRID| Grid distributed over                                     2 processors
 PW_GRID| Real space group dimensions                                     2    1
 PW_GRID| the grid is blocked:                                                NO
 PW_GRID| Cutoff [a.u.]                                                      1.5
 PW_GRID| spherical cutoff:                                                  YES
 PW_GRID| Grid points within cutoff                                         4547
 PW_GRID|   Bounds   1            -13      13                Points:          27
 PW_GRID|   Bounds   2            -13      13                Points:          27
 PW_GRID|   Bounds   3            -13      13                Points:          27
 PW_GRID| Volume element (a.u.^3)   5.357         Volume (a.u.^3)    105442.7279
 PW_GRID| Grid span                                                    HALFSPACE
 PW_GRID|   Distribution                         Average         Max         Min
 PW_GRID|   G-Vectors                             2273.5        2284        2263
 PW_GRID|   G-Rays                                 258.5         259         258
 PW_GRID|   Real Space Points                     9841.5       10206        9477

 RS_GRID| Information for grid number                                          2
 RS_GRID|   Bounds   1            -13      13                Points:          27
 RS_GRID|   Bounds   2            -13      13                Points:          27
 RS_GRID|   Bounds   3            -13      13                Points:          27
 RS_GRID| Real space fully replicated
 RS_GRID| Group size                                                           1

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0  0x7fc108b732e2 in ???
#1  0x7fc108b72475 in ???
#2  0x7fc10883297f in ???
#0  0x7f1a3313d2e2 in ???
#1  0x7f1a3313c475 in ???
#2  0x7f1a32dfc97f in ???
#3  0x15c3f77 in __eeq_method_MOD_eeq_solver
	at /cp2k/TMP/BUILD/cray-test-cp2k-2025.1.0/src/eeq_method.F:807
#3  0x15c3f77 in __eeq_method_MOD_eeq_solver
	at /cp2k/TMP/BUILD/cray-test-cp2k-2025.1.0/src/eeq_method.F:807
#4  0x15cd1ef in __eeq_method_MOD_eeq_charges
	at /cp2k/TMP/BUILD/cray-test-cp2k-2025.1.0/src/eeq_method.F:296
#5  0x15cdac7 in __eeq_method_MOD_eeq_print
	at /cp2k/TMP/BUILD/cray-test-cp2k-2025.1.0/src/eeq_method.F:143
#4  0x15cd1ef in __eeq_method_MOD_eeq_charges
	at /cp2k/TMP/BUILD/cray-test-cp2k-2025.1.0/src/eeq_method.F:296
#5  0x15cdac7 in __eeq_method_MOD_eeq_print
	at /cp2k/TMP/BUILD/cray-test-cp2k-2025.1.0/src/eeq_method.F:143
#6  0x1b6a134 in qs_scf_post_charges
	at /cp2k/TMP/BUILD/cray-test-cp2k-2025.1.0/src/qs_scf_post_se.F:569
#6  0x1b6a134 in qs_scf_post_charges
	at /cp2k/TMP/BUILD/cray-test-cp2k-2025.1.0/src/qs_scf_post_se.F:569
#7  0x1b6f5c5 in __qs_scf_post_se_MOD_scf_post_calculation_se
	at /cp2k/TMP/BUILD/cray-test-cp2k-2025.1.0/src/qs_scf_post_se.F:129
#7  0x1b6f5c5 in __qs_scf_post_se_MOD_scf_post_calculation_se
	at /cp2k/TMP/BUILD/cray-test-cp2k-2025.1.0/src/qs_scf_post_se.F:129
#8  0x1ae7b57 in __qs_scf_post_scf_MOD_qs_scf_compute_properties
	at /cp2k/TMP/BUILD/cray-test-cp2k-2025.1.0/src/qs_scf_post_scf.F:57
#8  0x1ae7b57 in __qs_scf_post_scf_MOD_qs_scf_compute_properties
	at /cp2k/TMP/BUILD/cray-test-cp2k-2025.1.0/src/qs_scf_post_scf.F:57
#9  0x23bab86 in __qs_scf_MOD_scf
	at /cp2k/TMP/BUILD/cray-test-cp2k-2025.1.0/src/qs_scf.F:299
#9  0x23bab86 in __qs_scf_MOD_scf
	at /cp2k/TMP/BUILD/cray-test-cp2k-2025.1.0/src/qs_scf.F:299
#10  0x15e81d3 in __qs_energy_MOD_qs_energies
	at /cp2k/TMP/BUILD/cray-test-cp2k-2025.1.0/src/qs_energy.F:126
#10  0x15e81d3 in __qs_energy_MOD_qs_energies
	at /cp2k/TMP/BUILD/cray-test-cp2k-2025.1.0/src/qs_energy.F:126
#11  0x128b0e8 in qs_forces
	at /cp2k/TMP/BUILD/cray-test-cp2k-2025.1.0/src/qs_force.F:200
#12  0x128efd2 in __qs_force_MOD_qs_calc_energy_force
	at /cp2k/TMP/BUILD/cray-test-cp2k-2025.1.0/src/qs_force.F:112
#11  0x128b0e8 in qs_forces
	at /cp2k/TMP/BUILD/cray-test-cp2k-2025.1.0/src/qs_force.F:200
#12  0x128efd2 in __qs_force_MOD_qs_calc_energy_force
	at /cp2k/TMP/BUILD/cray-test-cp2k-2025.1.0/src/qs_force.F:112
#13  0x91b6ef in __force_env_methods_MOD_force_env_calc_energy_force
	at /cp2k/TMP/BUILD/cray-test-cp2k-2025.1.0/src/force_env_methods.F:260
#13  0x91b6ef in __force_env_methods_MOD_force_env_calc_energy_force
	at /cp2k/TMP/BUILD/cray-test-cp2k-2025.1.0/src/force_env_methods.F:260
#14  0x413a8b in cp2k_run
	at /cp2k/TMP/BUILD/cray-test-cp2k-2025.1.0/src/start/cp2k_runs.F:329
#15  0x40e5c4 in __cp2k_runs_MOD_run_input
	at /cp2k/TMP/BUILD/cray-test-cp2k-2025.1.0/src/start/cp2k_runs.F:935
#14  0x413a8b in cp2k_run
	at /cp2k/TMP/BUILD/cray-test-cp2k-2025.1.0/src/start/cp2k_runs.F:329
#15  0x40e5c4 in __cp2k_runs_MOD_run_input
	at /cp2k/TMP/BUILD/cray-test-cp2k-2025.1.0/src/start/cp2k_runs.F:935
#16  0x40e140 in cp2k
	at /cp2k/TMP/BUILD/cray-test-cp2k-2025.1.0/src/start/cp2k.F:379
#16  0x40e140 in cp2k
	at /cp2k/TMP/BUILD/cray-test-cp2k-2025.1.0/src/start/cp2k.F:379
#17  0x40e282 in main
	at /cp2k/TMP/BUILD/cray-test-cp2k-2025.1.0/src/start/cp2k.F:44
#17  0x40e282 in main
	at /cp2k/TMP/BUILD/cray-test-cp2k-2025.1.0/src/start/cp2k.F:44
srun: error: nid001044: task 0: Segmentation fault
srun: Terminating StepId=1511673.0
slurmstepd: error: *** STEP 1511673.0 ON nid001044 CANCELLED AT 2025-01-16T14:09:17 ***
srun: error: nid001044: task 1: Segmentation fault (core dumped)

Here is a backtrace from the 'core' file using 'gdb':

(gdb) bt
#0  0x000000000149030b in eeq_method::eeq_solver (charges=..., lambda=0, eeq_energy=0,
    particle_set=..., kind_of=..., cell=0x6b3e280, chia=..., gam=..., gab=...,
    para_env=0x68a7a10, blacs_env=0x6ba3570, dft_control=0x6ba5420, eeq_sparam=...,
    totalcharge=0, ewald=.TRUE., ewald_env=0x9446c10, ewald_pw=0x6b7b390, iounit=1)
    at /lus/cflus03/cminear/src/cp2k/src/eeq_method.F:807
#1  0x00000000014994f6 in eeq_method::eeq_charges (qs_env=0x6b9d7f0, charges=...,
    eeq_sparam=..., eeq_model=2, enshift_type=1)
    at /lus/cflus03/cminear/src/cp2k/src/eeq_method.F:296
#2  0x0000000001499db4 in eeq_method::eeq_print (qs_env=0x6b9d7f0, iounit=-1, print_level=1)
    at /lus/cflus03/cminear/src/cp2k/src/eeq_method.F:143
#3  0x000000000243f2da in qs_scf_post_se::qs_scf_post_charges (input=0x6880b60,
    logger=0x68a82d0, qs_env=0x6b9d7f0, rho=0x48d9790, para_env=0x68a7a10)
    at /lus/cflus03/cminear/src/cp2k/src/qs_scf_post_se.F:569
#4  0x0000000002444403 in qs_scf_post_se::scf_post_calculation_se (qs_env=0x6b9d7f0)
    at /lus/cflus03/cminear/src/cp2k/src/qs_scf_post_se.F:129
#5  0x0000000002361aea in qs_scf_post_scf::qs_scf_compute_properties (qs_env=0x6b9d7f0,
    wf_type=<error reading variable: Cannot access memory at address 0x0>,
    do_mp2=<error reading variable: Cannot access memory at address 0x0>, _wf_type=0)
    at /lus/cflus03/cminear/src/cp2k/src/qs_scf_post_scf.F:57
#6  0x0000000001d19205 in qs_scf::scf (qs_env=0x6b9d7f0,
    has_converged=<error reading variable: Cannot access memory at address 0x0>,
    total_scf_steps=<error reading variable: Cannot access memory at address 0x0>)
    at /lus/cflus03/cminear/src/cp2k/src/qs_scf.F:299
#7  0x0000000000a606ab in qs_energy::qs_energies (qs_env=0x6b9d7f0,
    consistent_energies=<error reading variable: Cannot access memory at address 0x0>,
    calc_forces=.TRUE.) at /lus/cflus03/cminear/src/cp2k/src/qs_energy.F:126
#8  0x00000000019efe2f in qs_force::qs_forces (qs_env=0x6b9d7f0)
    at /lus/cflus03/cminear/src/cp2k/src/qs_force.F:200
#9  0x00000000019f3d07 in qs_force::qs_calc_energy_force (qs_env=0x6b9d7f0,
    calc_force=.TRUE., consistent_energies=.FALSE., linres=.FALSE.)
    at /lus/cflus03/cminear/src/cp2k/src/qs_force.F:112
#10 0x00000000013c72c3 in force_env_methods::force_env_calc_energy_force (
    force_env=0x6d26440, calc_force=.TRUE.,
    consistent_energies=<error reading variable: Cannot access memory at address 0x0>,
    skip_external_control=<error reading variable: Cannot access memory at address 0x0>,
    eval_energy_forces=<error reading variable: Cannot access memory at address 0x0>,
--Type <RET> for more, q to quit, c to continue without paging--
    require_consistent_energy_force=<error reading variable: Cannot access memory at address 0x0>, linres=<error reading variable: Cannot access memory at address 0x0>,
    calc_stress_tensor=<error reading variable: Cannot access memory at address 0x0>)
    at /lus/cflus03/cminear/src/cp2k/src/force_env_methods.F:260
#11 0x0000000000436c63 in cp2k_runs::cp2k_run (input_declaration=0x4aafa40,
    input_file_name=..., output_unit=-1, mpi_comm=..., initial_variables=...,
    _input_file_name=1024, _initial_variables=1024)
    at /lus/cflus03/cminear/src/cp2k/src/start/cp2k_runs.F:329
#12 0x00000000004317de in cp2k_runs::run_input (input_declaration=0x4aafa40,
    input_file_path=..., output_file_path=..., initial_variables=...,
    mpi_comm=<error reading variable: Cannot access memory at address 0x0>,
    _input_file_path=1024, _output_file_path=1024, _initial_variables=1024)
    at /lus/cflus03/cminear/src/cp2k/src/start/cp2k_runs.F:935
#13 0x000000000040e0fd in cp2k () at /lus/cflus03/cminear/src/cp2k/src/start/cp2k.F:379
#14 0x000000000040e241 in main (argc=2, argv=0x7ffe3d02d31f)
    at /lus/cflus03/cminear/src/cp2k/src/start/cp2k.F:44
#15 0x00007fb465e03eec in __libc_start_call_main () from /lib64/libc.so.6
#16 0x00007fb465e03fb5 in __libc_start_main_impl () from /lib64/libc.so.6
#17 0x000000000040b9d1 in _start () at ../sysdeps/x86_64/start.S:115

To reiterate, if the same test is run using the 'v2024.3' version, the test works as expected.

Jan 16 '25 20:01 cdm-work

out of curiosity, what happens if you run a "CPU-only" run? From the output, it seems the error comes from the PW (FFTW), so maybe you can try to disable it (-D__NO_OFFLOAD_PW). Is there any reason why you are not using DBCSR acceleration, i.e. -D__DBCSR_ACC? Note that -D__ACC is not used anymore, as far as I know... BTW, are you running on a HPE Cray system? if so, I would suggest do not use -mtune=native...

Jan 24 '25 15:01 alazzaro

The CPU-only versions of the 4 tests identified work with both the 'v2024.3' and 'v2025.1' versions. I will also note that our Nvidia GPU tests also work with both versions.

I can try adding the -D__NO_OFFLOAD_PW. But the original case would still seem to be a regression as it was working in 'v2024.3' without this option (unless it is new).

I am working with an HPE Cray system. I will investigate dropping the -mtune=native option. (I'm surprised I missed that, since I try to avoid using that with other builds.) I will also investigate changing to use -D__DBCSR_ACC. I assume whenever I first added building CP2K for AMD GPUs, I found that the one option worked. Admittedly, I tend not to investigate if new build options are present or preferred if the new version builds with the existing 'arch' file.

Jan 24 '25 18:01 cdm-work

The same tests are still falling in the same way with the use of -D__NO_OFFLOAD_PW in place. I also removed the -mtune=native use and replaces -D__ACC with -D__DBCSR_ACC; I had to add -lhipblas to the LIBS +=... line to get the compilation to complete successfully.

Jan 24 '25 20:01 cdm-work