Aditya Nishtala
Aditya Nishtala
LAMMPS heavily uses alltoallv during setup, setting up the atoms, creating bonds and neighbor lists. Here is the timing info on those alltoallv calls First Table below is time (in...
I tried out MPIR_CVAR_ALLTOALLV_PAIRWISE_NEW=1 on mpich/opt/5.0.0.aurora_test.06f012a even after 40 mins, at 64 nodes for 1KB message size the run did not complete. After disabling nohz_full on 64 specific nodes, osu_alltoallv...
So what i found out, Is on Sunspot there is no issue with alltoallv. Works just fine. On Aurora and Aurora only the issue pops even though sunspot uses the...
Forgot to add, current on Aurora next-eval there are 2 MPICH version available. mpich/opt/5.0.0.aurora_test.06f012a (the default loaded) and mpich/opt/4.3.1 right now lammps sow only retains its performance and hits it's...
Today we came across 64 "good" nodes the exhibit the alltoallv problem at a significantly less amount The good nodes are `x4006c2s3b0n0 x4006c2s4b0n0 x4006c2s5b0n0 x4006c2s6b0n0 x4006c2s7b0n0 x4006c3s0b0n0 x4006c3s1b0n0 x4006c3s2b0n0 x4006c3s3b0n0...
As per my previous comment both sets of 64 nodes are from next-eval and they both have nohz_full on all nodes within the set. this means that even tho turning...
> MPIR_CVAR_CH4_PROGRESS_THROTTLE=1 alleviates the issue, then it is the CPU/NIC memory contention I don't believe this is a hardware related issue. On Sunspot and Borealis this problem doesn't exist and...