neko
neko copied to clipboard
gslib segfaults on LUMI/Dardel for large cases
As the title says, I have encountered issues running with probes on Dardel (GPU and CPU) and LUMI-G.
Observed behaviour
Simulation freezes and segfaults at fgslib_findpts_setup
in global_interpolator
.
Only error message that is dumped is as follows:
srun: error: nid001799: task 175: Segmentation fault (core dumped)
The case is a simple box with constant inflow/outflow. I attach a zip folder for a test case to check reproducibility. case.zip. The case can be run with turboneko.
- The issue appears on
release/0.8
withexport ATP_ENABLED=true
(but from memory this was also happening with develop) - The issue appears regardless of the number of probes (tested with 1 and 40,000 probes)
- This does not happen on smaller cases like
rayleigh-benard-cylinder
in our examples.
Config
On Dardel GPU
Modules:
- cray Fortran (PrgEnv-cray/8.4.0)
- cce/16.0.1 (cpe/23.09)
- rocm/5.7.0
- craype-accel-amd-gfx90a
- json-fortran/8.3.0
Configuration: ./configure FC=ftn CC=cc MPIFC=ftn MPICC=cc HIPCC=hipcc --with-hip HIP_HIPCC_FLAGS=-O3 --offload-arch=gfx90a --enable-device-mpi --with-gslib=$GSLIB --host=x86_64-pc-linux-gnu
LUMI-G
Edit: Somehow I cannot reproduce it on LUMI anymore, or rather no segfault but it still freezes for a long time at fgslib_findpts_setup
.
- PrgEnv-cray/8.4.0
- cce/16.0.1
- rocm/5.2.3
- craype-accel-amd-gfx90a
- json-fortran/8.3.0 (compiled separately)
Configuration: ./configure --with-gslib=$GSLIB FC=ftn CC=cc HIPCC=hipcc MPIFC=ftn MPICC=cc --with-hip HIP_HIPCC_FLAGS=-O3 -x hip --offload-arch=gfx90a --enable-device-mpi