RMG-Java
RMG-Java copied to clipboard
Gaussian freezes job when running in QM mode
A couple of my jobs made no progress over the weekend. On logging in to the compute node where they were running and typing top
, I discover that the gaussian program l103.exe has been running ~5000 minutes, without progress. In this scenario it would be good for RMG to just kill the process and carry on with the next attempt.
top - 15:28:31 up 159 days, 3:50, 1 user, load average: 2.25, 2.17, 2.06 Tasks: 180 total, 3 running, 177 sleeping, 0 stopped, 0 zombie Cpu(s): 28.2%us, 0.0%sy, 0.0%ni, 71.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 16465776k total, 13075504k used, 3390272k free, 342396k buffers Swap: 11999224k total, 3140k used, 11996084k free, 9402620k cached PID to kill: 7279 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 7279 rwest 20 0 139m 53m 1044 R 100 0.3 5015:43 l103.exe 17585 rwest 20 0 139m 53m 1044 R 100 0.3 4996:03 l103.exe 9839 rwest 20 0 2956m 1.3g 8908 S 0 8.1 75:18.92 java
Here are the input files to two jobs that have been frozen in l103.exe for ~15 hours now. Notice they are the same molecule.
rwest@node47:/tmp/60336.1.long2/QMfiles$ cat ASIRPOSRZLOBKS-UHFFFAOYAJ.gjf %chk=/tmp/60336.1.long2/QMfiles/RMGrunCHKfile.chk %mem=6MW %nproc=1 # pm3 opt=(tight,nolinear,calcfc,small,maxcyc=200) freq IOP(2/16=3) InChI=1/C2O4/c3-1-2(3,4-1)6-5-1 0 1 O -0.91470 -1.09320 0.11030 C -0.28610 -0.01190 -0.75400 O 1.17360 -0.11080 -0.65280 C -0.31180 0.04870 0.62450 O -0.75130 1.21630 0.01190 O 1.09030 -0.04900 0.66010
rwest@node47:/tmp/60367.1.long2/QMfiles$ cat ASIRPOSRZLOBKS-UHFFFAOYAJ.gjf %chk=/tmp/60367.1.long2/QMfiles/RMGrunCHKfile.chk %mem=6MW %nproc=1 # pm3 opt=(tight,nolinear,calcfc,small,maxcyc=200) freq IOP(2/16=3) InChI=1/C2O4/c3-1-2(3,4-1)6-5-1 0 1 O -0.91470 -1.09320 0.11030 C -0.28610 -0.01190 -0.75400 O 1.17360 -0.11080 -0.65280 C -0.31180 0.04870 0.62450 O -0.75130 1.21630 0.01190 O 1.09030 -0.04900 0.66010
and here's the 'top' output:
top - 12:32:41 up 162 days, 54 min, 1 user, load average: 2.02, 2.02, 2.00 Tasks: 180 total, 3 running, 177 sleeping, 0 stopped, 0 zombie Cpu(s): 0.0%us, 0.0%sy, 27.1%ni, 72.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 16465776k total, 9774520k used, 6691256k free, 172912k buffers Swap: 11999224k total, 21428k used, 11977796k free, 5746328k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 14680 rwest 30 10 139m 53m 1044 R 100 0.3 913:58.62 l103.exe 15160 rwest 30 10 139m 53m 1044 R 100 0.3 840:22.49 l103.exe
As for this particular molecule, running on pharos with this in the condition file at 85228e998d95f27d415d9df39b3fc2016b8408cf, I don't encounter the hang on attempt #10, and attempt #14 "succeeds" but has optimized to a different structure (this is a known issue of unknown significance...I have seen it several times with wacky species with MOPAC but I think this may be the first time I've seen this occurring with Gaussian; I originally had a way of dealing with this, but I'm not sure that it is robust, so it is currently commented out.)
In any case, PM3 seems to break this up into two CO2 molecules with other keyword choices, so I suspect that this should be a forbidden structure.
For the more general issue of what to do when Gaussian hangs, consider a timeout option on my to-do list.
PS...here is the adjacency list I used: 1 C 0 {2,S} {3,S} {4,S} {5,S} 2 C 0 {1,S} {3,S} {4,S} {6,S} 3 O 0 {1,S} {2,S} 4 O 0 {1,S} {2,S} 5 O 0 {1,S} {6,S} 6 O 0 {2,S} {5,S}