cuSOLVERMp hangs on mppotrs when running on subset of nodes

Open s769 opened this issue 1 year ago • 0 comments

I was testing the mp_potrf_potrs example (with fixed SPD matrix generation code) on several configurations on Perlmutter. When I request 1 node (4 GPUs), running

srun -u -n 4 --gpus-per-node 4 ./mp_potrf_potrs -p 4 -q 1 -ia 1 -ja 1 -ib 1 -jb 1 -mbA 2500 
-nbA 2500 -mbB 2500 -nbB 2500 -n 10000

works fine. However, if I request 2 nodes and run the same thing, it hangs after the potrf step (i.e. potrf completes successfully, but the potrs hangs). I also tried running srun -n 8 ... (keeping -p 4 -q 1), but this seems to hang at the scatter from host to device. If I decrease n to 1000 (and the tile sizes to 250), the code runs successfully with srun -n 4 .... I don't think it's an out-of-memory issue though, since the same code with n=10000 runs when I only request one node.

It's also interesting that the potrf completes but the potrs hangs; not sure what could be causing that.

I'll keep trying different configurations; please let me know if you would like any log output.

Module list:

1) craype-x86-milan 8) gpu/1.0 15) evp-patch
2) libfabric/1.15.2.0 9) craype/2.7.30 (c) 16) python/3.11 (dev)
3) craype-network-ofi 10) cray-dsmml/0.2.2 17) cudatoolkit/12.4 (g)
4) xpmem/2.6.2-2.5_2.38__gd067c3f.shasta 11) cray-libsci/23.12.5 (math) 18) nvidia/24.5 (g,c)
5) perftools-base/23.12.0 12) PrgEnv-nvidia/8.5.0 (cpe) 19) cray-mpich/8.1.28 (mpi)
6) cpe/23.12 13) cray-hdf5-parallel/1.12.2.3 (io)
7) craype-accel-nvidia80 14) conda/Miniconda3-py311_23.11.0-2

Nov 15 '24 03:11 s769