entity icon indicating copy to clipboard operation
entity copied to clipboard

[BUG] Memory leaks and crashes on AMD MI300A APU

Open Tissot11 opened this issue 2 months ago • 6 comments

I could run successfully a reduced version of magnetised shock problem on CUDA which takes about 4 GM of RAM on 2 nodes (8 GPUs) (according to Slurm) and last for about 7 minutes. However, running the same problem on MI300A (2 nodes, 8 GPUs), there are severe memory leaks (> 500 GB) on single and multiple nodes leading to crash of the run. I attach the err and outfiles together with the shock.txt. I used

cray-hdf5-parallel/1.14.3.1 rocm-6.2.2 modules

with

export HSA_OVERRIDE_GFX_VERSION=9.4.2; export MPICH_GPU_SUPPORT_ENABLED=1

cmake -B build -D pgen=shock -D mpi=ON -D CMAKE_CXX_COMPILER=hipcc -D CMAKE_C_COMPILER=hipcc -D Kokkos_ENABLE_HIP=ON -D Kokkos_ARCH_AMD_GFX942_APU=ON

errEntity.txt outEntity.txt

shock.txt

Tissot11 avatar Oct 27 '25 16:10 Tissot11

this might be related to #137, i'm planning to look into this after the 1.3.0 release.

haykh avatar Oct 29 '25 06:10 haykh

Ok. Would release 1.3.0 have the #48 included? This is absolutely necessary for shock simulations and one of the main reasons for me to not use Entity for shock simulations yet...

Tissot11 avatar Oct 29 '25 15:10 Tissot11

unlikely, implementing #48 is not that hard per se, but i don't know how useful it will be (it might also be quite memory-intensive, depending on the number of bins etc): if you think it's useful for you -- i can look into it (also, feel free to join our slack and/or weekly meetings where we have all these discussions and more!).

of the major things planned for 1.3.0 are: #141 (particle tracking) #109 (high-order shape functions, with a method paper coming) and #103 (generalized field stencils for better Cherenkov mitigation). eta is something like the next two weeks.

haykh avatar Oct 29 '25 21:10 haykh

For shock problems, having a diagnostic that could save particle momenta with x is extremely useful. If this is not hard to implement for you then I would really request it.

I would love to join the Slack channel and weekly meeting, if you could send me the links. Do you paste it here or send it by email?

Tissot11 avatar Nov 03 '25 14:11 Tissot11

(the instructions are in the README) Basically, @Tissot11 just send me an email, and I'll add you to the slack using your email. It's open for everyone, but since it's paid per-user, we can't post the link online.

haykh avatar Nov 04 '25 06:11 haykh

Just sent you an email.

Tissot11 avatar Nov 05 '25 11:11 Tissot11