[BUG] Memory allocation error from Kokkos random number generator
describe the bug When running on 16 and more nodes on Aurora/SuperMUC-NG2 (Intel PVC nodes) entity crashes with the error:
FATAL : Kokkos ERROR: SYCLDeviceUSM memory space failed to allocate 128 MiB (label="Kokkos::Random_XorShift1024::state").
FATAL : see the `*.err` file for more details
Even though there should be plenty of free memory available.
@haykh mentioned that this is caused by the Kokkos random number generator.
This issue is meant to track that problem. We should either figure this out with the Kokkos developers or replace the random number generator.
code version
1.3.0rc on hash: 4ab9bf3f450f374c67659527bab2ef97d62b73b5
compiler/library versions Intel compiler with MPI: IntelLLVM 2025.2.0 (oneapi_2025.2.0/mpi/2021.16) Kokkos: 4.6.02 + 4.7.00 both show this problem.
cmake configuration command
On SuperMUC-NG2: cmake -B build -D pgen=streaming -D precision=single -D mpi=ON -D output=OFF -Dgpu_aware_mpi=OFF -DCMAKE_C_COMPILER=mpiicx -DCMAKE_CXX_COMPILER=mpiicpx