code doesn't always exit cleanly on BG/Q
I don't know what happened just now but my latest run stopped after the last measurement and just stood there for an hour... (again, wasted computer time...)
In fact, I think this is why I observed the problem of running into the wallclock limit... I guess we thus have even better performance, maybe 33 minutes per trajectory...
Whenever I've seen something like this happen, it was usually a problem with the filesystem. Was there an I/O call you may have gotten stuck at?
No, all the I/O finished cleanly
It's weird because in other instances the code exits cleanly just fine, in others it gets stuck after the last online measurement...
did it write the last message, too? I mean the acceptance rate?
no
# mu = 0.000900, kappa = 0.137280, csw = 1.575510
# CG: iter: 33742 eps_sq: 1.0000e-20 t/s: 8.3376e+01
# CG: flopcount (for e/o tmWilson only): t/s: 8.3376e+01 mflops_local: 13846.8 mflops: 14179129.7
# Inversion done in 33742 iterations, squared residue = 9.632945e-15!
# Inversion done in 8.34e+01 sec.
ONLINE: measurement done int t/s = 8.4132e+01
2013-03-31 17:21:11.569 (WARN ) [0x40001448c80] :ibm.runjob.LogSignalInfo: received signal 24
2013-03-31 17:21:11.570 (WARN ) [0x40001448c80] :ibm.runjob.LogSignalInfo: signal sent from USER
2013-03-31 17:21:11.570 (WARN ) [0x40001448c80] :ibm.runjob.LogSignalInfo: sent from pid 20353
2013-03-31 17:21:11.570 (WARN ) [0x40001448c80] :ibm.runjob.LogSignalInfo: could not read /proc/20353/exe
2013-03-31 17:21:11.570 (WARN ) [0x40001448c80] :ibm.runjob.LogSignalInfo: Permission denied
2013-03-31 17:21:11.571 (WARN ) [0x40001448c80] :ibm.runjob.LogSignalInfo: sent from uid 0 (root)
2013-03-31 17:21:13.374 (WARN ) [0x40001448c80] :360729:ibm.runjob.client.Job: terminated by signal 9
2013-03-31 17:21:13.375 (WARN ) [0x40001448c80] :360729:ibm.runjob.client.Job: abnormal termination by signal 9 from rank 36
2013-03-31 17:21:13.375 (WARN ) [0x40001448c80] :360729:ibm.runjob.client.Job: 7 RAS events
2013-03-31 17:21:13.375 (WARN ) [0x40001448c80] :360729:ibm.runjob.client.Job: most recent RAS event text: L1P Correctable Error Summary : count=3 cores=2,10,13 L1P_ESR : [ERR_RELOAD_ECC_X2] correctable reload data ECC error;
There are unfortauntely no timestamps produced in the code so I don't know how long it idled before it was killed at 17:21:11 but it should have exited cleanly here as it had done all 42 trajectories.
Maybe it got into a deadlock because I chose to manually edit .nstore_counter (due to the corrupt conf.0228) and I may have messed up some convention?