neko icon indicating copy to clipboard operation
neko copied to clipboard

gslib segfaults on LUMI/Dardel for large cases

Open vbaconnet opened this issue 7 months ago • 1 comments

As the title says, I have encountered issues running with probes on Dardel (GPU and CPU) and LUMI-G.

Observed behaviour

Simulation freezes and segfaults at fgslib_findpts_setup in global_interpolator.

Only error message that is dumped is as follows:

srun: error: nid001799: task 175: Segmentation fault (core dumped)

The case is a simple box with constant inflow/outflow. I attach a zip folder for a test case to check reproducibility. case.zip. The case can be run with turboneko.

  • The issue appears on release/0.8 with export ATP_ENABLED=true (but from memory this was also happening with develop)
  • The issue appears regardless of the number of probes (tested with 1 and 40,000 probes)
  • This does not happen on smaller cases like rayleigh-benard-cylinder in our examples.

Config

On Dardel GPU

Modules:

  • cray Fortran (PrgEnv-cray/8.4.0)
  • cce/16.0.1 (cpe/23.09)
  • rocm/5.7.0
  • craype-accel-amd-gfx90a
  • json-fortran/8.3.0

Configuration: ./configure FC=ftn CC=cc MPIFC=ftn MPICC=cc HIPCC=hipcc --with-hip HIP_HIPCC_FLAGS=-O3 --offload-arch=gfx90a --enable-device-mpi --with-gslib=$GSLIB --host=x86_64-pc-linux-gnu

LUMI-G

Edit: Somehow I cannot reproduce it on LUMI anymore, or rather no segfault but it still freezes for a long time at fgslib_findpts_setup.

  • PrgEnv-cray/8.4.0
  • cce/16.0.1
  • rocm/5.2.3
  • craype-accel-amd-gfx90a
  • json-fortran/8.3.0 (compiled separately)

Configuration: ./configure --with-gslib=$GSLIB FC=ftn CC=cc HIPCC=hipcc MPIFC=ftn MPICC=cc --with-hip HIP_HIPCC_FLAGS=-O3 -x hip --offload-arch=gfx90a --enable-device-mpi

vbaconnet avatar Jul 10 '24 15:07 vbaconnet