tmLQCD code doesn't always exit cleanly on BG/Q

I don't know what happened just now but my latest run stopped after the last measurement and just stood there for an hour... (again, wasted computer time...)

Apr 04 '13 14:04 kostrzewa

In fact, I think this is why I observed the problem of running into the wallclock limit... I guess we thus have even better performance, maybe 33 minutes per trajectory...

Apr 04 '13 14:04 kostrzewa

Whenever I've seen something like this happen, it was usually a problem with the filesystem. Was there an I/O call you may have gotten stuck at?

Apr 05 '13 11:04 deuzeman

No, all the I/O finished cleanly

Apr 05 '13 12:04 kostrzewa

It's weird because in other instances the code exits cleanly just fine, in others it gets stuck after the last online measurement...

Apr 05 '13 12:04 kostrzewa

did it write the last message, too? I mean the acceptance rate?

Apr 05 '13 12:04 urbach

no

Apr 05 '13 12:04 kostrzewa

# mu = 0.000900, kappa = 0.137280, csw = 1.575510
# CG: iter: 33742 eps_sq: 1.0000e-20 t/s: 8.3376e+01
# CG: flopcount (for e/o tmWilson only): t/s: 8.3376e+01 mflops_local: 13846.8 mflops: 14179129.7
# Inversion done in 33742 iterations, squared residue = 9.632945e-15!
# Inversion done in 8.34e+01 sec. 
ONLINE: measurement done int t/s = 8.4132e+01
2013-03-31 17:21:11.569 (WARN ) [0x40001448c80] :ibm.runjob.LogSignalInfo: received signal 24
2013-03-31 17:21:11.570 (WARN ) [0x40001448c80] :ibm.runjob.LogSignalInfo: signal sent from USER
2013-03-31 17:21:11.570 (WARN ) [0x40001448c80] :ibm.runjob.LogSignalInfo: sent from pid 20353
2013-03-31 17:21:11.570 (WARN ) [0x40001448c80] :ibm.runjob.LogSignalInfo: could not read /proc/20353/exe
2013-03-31 17:21:11.570 (WARN ) [0x40001448c80] :ibm.runjob.LogSignalInfo: Permission denied
2013-03-31 17:21:11.571 (WARN ) [0x40001448c80] :ibm.runjob.LogSignalInfo: sent from uid 0 (root)
2013-03-31 17:21:13.374 (WARN ) [0x40001448c80] :360729:ibm.runjob.client.Job: terminated by signal 9
2013-03-31 17:21:13.375 (WARN ) [0x40001448c80] :360729:ibm.runjob.client.Job: abnormal termination by signal 9 from rank 36
2013-03-31 17:21:13.375 (WARN ) [0x40001448c80] :360729:ibm.runjob.client.Job: 7 RAS events
2013-03-31 17:21:13.375 (WARN ) [0x40001448c80] :360729:ibm.runjob.client.Job: most recent RAS event text: L1P Correctable Error Summary : count=3 cores=2,10,13 L1P_ESR : [ERR_RELOAD_ECC_X2] correctable reload data ECC error;

There are unfortauntely no timestamps produced in the code so I don't know how long it idled before it was killed at 17:21:11 but it should have exited cleanly here as it had done all 42 trajectories.

Maybe it got into a deadlock because I chose to manually edit .nstore_counter (due to the corrupt conf.0228) and I may have messed up some convention?

Apr 05 '13 12:04 kostrzewa