nwchem icon indicating copy to clipboard operation
nwchem copied to clipboard

emit better error messages when disk is full and EAF writes fail

Open yurivict opened this issue 5 years ago • 9 comments

This inp file:

title "x"
echo
geometry units angstroms
C -0.76198 1.17875 -0.00473
C 0.63084 1.25353 -0.00749
C 1.39201 0.08470 -0.00938
C 0.76036 -1.15891 -0.00853
C -0.63246 -1.23369 -0.00578
C -1.39363 -0.06486 -0.00388
H -1.35502 2.08940 -0.00326
H 1.12297 2.22244 -0.00815
H 2.47717 0.14296 -0.01153
H 1.35339 -2.06956 -0.01000
H -1.12459 -2.20260 -0.00511
H -2.47879 -0.12312 -0.00174
C 1.01744 2.82421 2.57039
C 2.26864 3.35365 2.25459
C 3.24586 2.53983 1.68163
C 2.97188 1.19657 1.42447
C 1.72068 0.66713 1.74027
C 0.74346 1.48095 2.31323
H 0.25607 3.45827 3.01679
H 2.48211 4.40020 2.45495
H 4.22069 2.95232 1.43559
H 3.73325 0.56251 0.97807
H 1.50721 -0.37942 1.53991
H -0.23137 1.06846 2.55927
end
basis
 C library 6-311G*
 H library 6-311G*
end
scf
  thresh 0.01
end
task scf optimize

leads to the crash:

----------------------------------------------
         Quadratically convergent ROHF

 Convergence threshold     :          1.000E-02
 Maximum no. of iterations :           30
 Final Fock-matrix accuracy:          1.000E-07
 ----------------------------------------------


 Integral file          = ./inp.aoints.0
 Record size in doubles =  65536        No. of integs per rec  =  32766
 Max. records in memory =      0        Max. records in file   =    793
 No. of bits per label  =     16        No. of bits per value  =     64

eaf_write: rc ne bytes -1999 bytes 524288
 eaf_write: rc ne bytes -1999 bytes 524288
   IO offset    240123904.00000000
  IO error message >Write Failed
  IO offset    188219392.00000000
  IO error message >Write Failed
eaf_write: rc ne bytes -1999 bytes 524288
 eaf_write: rc ne bytes -1999 bytes 524288
   IO offset    360710144.00000000
  IO error message >Write Failed

nwchem-6.8.1.20190222 (rev. d8ac0a182) on FreeBSD 11.2 amd64, ga-5.7_4. Run on 8 CPUs using MPI (mpirun).

yurivict avatar Mar 01 '19 08:03 yurivict

Add direct to SCF input block.

jeffhammond avatar Mar 01 '19 15:03 jeffhammond

direct helped. So, if integral recomputation isn't forced, some integrals end up being wrong?

yurivict avatar Mar 01 '19 17:03 yurivict

What integrals were wrong in your first job? The I/O failed and the job crashed.

jeffhammond avatar Mar 01 '19 17:03 jeffhammond

How can the I/O fail? There needs to be a specific reason with an error code. It runs on one machine so it's unlikely that I/O just fails.

yurivict avatar Mar 01 '19 17:03 yurivict

HW I/O error or you ran out of disk space.

edoapra avatar Mar 01 '19 18:03 edoapra

I haven't spent a lot of time looking at EAF but it's pretty clear that EAF is returning the error code -1999 for some reason. As Edo said, inadequate disk space is a likely cause.

eaf_write: rc ne bytes -1999 bytes 524288
 eaf_write: rc ne bytes -1999 bytes 524288
   IO offset    240123904.00000000
  IO error message >Write Failed
  IO offset    188219392.00000000
  IO error message >Write Failed
eaf_write: rc ne bytes -1999 bytes 524288
 eaf_write: rc ne bytes -1999 bytes 524288
   IO offset    360710144.00000000
  IO error message >Write Failed

jeffhammond avatar Mar 01 '19 18:03 jeffhammond

You are right, the system log has 'disk full' errors at around this time. NWChem error messages aren't clear, and this causes confusion.

yurivict avatar Mar 01 '19 18:03 yurivict

Thank you for clarifying this!

yurivict avatar Mar 01 '19 18:03 yurivict

I will add a better error message for this.

jeffhammond avatar Mar 01 '19 20:03 jeffhammond