entity icon indicating copy to clipboard operation
entity copied to clipboard

[BUG] Memory allocation error from Kokkos random number generator

Open LudwigBoess opened this issue 5 months ago • 0 comments

describe the bug When running on 16 and more nodes on Aurora/SuperMUC-NG2 (Intel PVC nodes) entity crashes with the error:

FATAL : Kokkos ERROR: SYCLDeviceUSM memory space failed to allocate 128 MiB (label="Kokkos::Random_XorShift1024::state").
FATAL : see the `*.err` file for more details

Even though there should be plenty of free memory available. @haykh mentioned that this is caused by the Kokkos random number generator. This issue is meant to track that problem. We should either figure this out with the Kokkos developers or replace the random number generator.

code version 1.3.0rc on hash: 4ab9bf3f450f374c67659527bab2ef97d62b73b5

compiler/library versions Intel compiler with MPI: IntelLLVM 2025.2.0 (oneapi_2025.2.0/mpi/2021.16) Kokkos: 4.6.02 + 4.7.00 both show this problem.

cmake configuration command On SuperMUC-NG2: cmake -B build -D pgen=streaming -D precision=single -D mpi=ON -D output=OFF -Dgpu_aware_mpi=OFF -DCMAKE_C_COMPILER=mpiicx -DCMAKE_CXX_COMPILER=mpiicpx

LudwigBoess avatar Aug 14 '25 20:08 LudwigBoess