jdftx
jdftx copied to clipboard
JDFTx Fails with MPI_Abort under all conditions on Polaris
All of the jobs are failing with the following error:
MPICH ERROR [Rank 1] [job id f1ce81c3-6a27-482f-bc31-942246dcb469] [Wed Jun 5 19:47:53 2024] [x3109c0s19b1n0] - Abort(1) (rank 1 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
MPICH ERROR [Rank 3] [job id f1ce81c3-6a27-482f-bc31-942246dcb469] [Wed Jun 5 19:47:53 2024] [x3109c0s19b1n0] - Abort(1) (rank 3 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 3
aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 3
x3109c0s19b1n0.hsn.cm.polaris.alcf.anl.gov: rank 3 exited with code 255
x3109c0s19b1n0.hsn.cm.polaris.alcf.anl.gov: rank 1 died from signal 15
Here is the stack trace that is dumped:
cbu@polaris-login-04:~/dft_out/FeN4C10/clean> cat jdftx-stacktrace
/home/cbu/jdftx/build/libjdftx_gpu.so(_Z10printStackb+0x27) [0x147e12f863a7]
/home/cbu/jdftx/build/libjdftx_gpu.so(_Z14stackTraceExiti+0xd) [0x147e12f86acd]
/home/cbu/jdftx/build/libjdftx_gpu.so(_Z8choleskyRK6matrixb+0x372) [0x147e12f94d72]
/home/cbu/jdftx/build/libjdftx_gpu.so(_Z11orthoMatrixRK6matrix+0xdc) [0x147e12f9549c]
/home/cbu/jdftx/build/libjdftx_gpu.so(_ZN8ElecVars14orthonormalizeEiP6matrix+0x141) [0x147e131323c1]
/home/cbu/jdftx/build/libjdftx_gpu.so(_ZN16LatticeMinimizer4stepERK15LatticeGradientd+0xcb6) [0x147e1322d1f6]
/home/cbu/jdftx/build/libjdftx_gpu.so(_ZN16LatticeMinimizer8minimizeERK14MinimizeParams+0x64) [0x147e1322e464]
/home/cbu/jdftx/build/jdftx_gpu() [0x40868e]
/lib64/libc.so.6(__libc_start_main+0xef) [0x147e02a3e24d]
/home/cbu/jdftx/build/jdftx_gpu() [0x407ffa]
cbu@polaris-login-04:~/dft_out/FeN4C10/clean>