[BUG] Memory leaks and crashes on AMD MI300A APU
I could run successfully a reduced version of magnetised shock problem on CUDA which takes about 4 GM of RAM on 2 nodes (8 GPUs) (according to Slurm) and last for about 7 minutes. However, running the same problem on MI300A (2 nodes, 8 GPUs), there are severe memory leaks (> 500 GB) on single and multiple nodes leading to crash of the run. I attach the err and outfiles together with the shock.txt. I used
cray-hdf5-parallel/1.14.3.1 rocm-6.2.2 modules
with
export HSA_OVERRIDE_GFX_VERSION=9.4.2; export MPICH_GPU_SUPPORT_ENABLED=1
cmake -B build -D pgen=shock -D mpi=ON -D CMAKE_CXX_COMPILER=hipcc -D CMAKE_C_COMPILER=hipcc -D Kokkos_ENABLE_HIP=ON -D Kokkos_ARCH_AMD_GFX942_APU=ON
this might be related to #137, i'm planning to look into this after the 1.3.0 release.
Ok. Would release 1.3.0 have the #48 included? This is absolutely necessary for shock simulations and one of the main reasons for me to not use Entity for shock simulations yet...
unlikely, implementing #48 is not that hard per se, but i don't know how useful it will be (it might also be quite memory-intensive, depending on the number of bins etc): if you think it's useful for you -- i can look into it (also, feel free to join our slack and/or weekly meetings where we have all these discussions and more!).
of the major things planned for 1.3.0 are: #141 (particle tracking) #109 (high-order shape functions, with a method paper coming) and #103 (generalized field stencils for better Cherenkov mitigation). eta is something like the next two weeks.
For shock problems, having a diagnostic that could save particle momenta with x is extremely useful. If this is not hard to implement for you then I would really request it.
I would love to join the Slack channel and weekly meeting, if you could send me the links. Do you paste it here or send it by email?
(the instructions are in the README) Basically, @Tissot11 just send me an email, and I'll add you to the slack using your email. It's open for everyone, but since it's paid per-user, we can't post the link online.
Just sent you an email.