tmLQCD icon indicating copy to clipboard operation
tmLQCD copied to clipboard

code doesn't always exit cleanly on BG/Q

Open kostrzewa opened this issue 12 years ago • 7 comments

I don't know what happened just now but my latest run stopped after the last measurement and just stood there for an hour... (again, wasted computer time...)

kostrzewa avatar Apr 04 '13 14:04 kostrzewa

In fact, I think this is why I observed the problem of running into the wallclock limit... I guess we thus have even better performance, maybe 33 minutes per trajectory...

kostrzewa avatar Apr 04 '13 14:04 kostrzewa

Whenever I've seen something like this happen, it was usually a problem with the filesystem. Was there an I/O call you may have gotten stuck at?

deuzeman avatar Apr 05 '13 11:04 deuzeman

No, all the I/O finished cleanly

kostrzewa avatar Apr 05 '13 12:04 kostrzewa

It's weird because in other instances the code exits cleanly just fine, in others it gets stuck after the last online measurement...

kostrzewa avatar Apr 05 '13 12:04 kostrzewa

did it write the last message, too? I mean the acceptance rate?

urbach avatar Apr 05 '13 12:04 urbach

no

kostrzewa avatar Apr 05 '13 12:04 kostrzewa

# mu = 0.000900, kappa = 0.137280, csw = 1.575510
# CG: iter: 33742 eps_sq: 1.0000e-20 t/s: 8.3376e+01
# CG: flopcount (for e/o tmWilson only): t/s: 8.3376e+01 mflops_local: 13846.8 mflops: 14179129.7
# Inversion done in 33742 iterations, squared residue = 9.632945e-15!
# Inversion done in 8.34e+01 sec. 
ONLINE: measurement done int t/s = 8.4132e+01
2013-03-31 17:21:11.569 (WARN ) [0x40001448c80] :ibm.runjob.LogSignalInfo: received signal 24
2013-03-31 17:21:11.570 (WARN ) [0x40001448c80] :ibm.runjob.LogSignalInfo: signal sent from USER
2013-03-31 17:21:11.570 (WARN ) [0x40001448c80] :ibm.runjob.LogSignalInfo: sent from pid 20353
2013-03-31 17:21:11.570 (WARN ) [0x40001448c80] :ibm.runjob.LogSignalInfo: could not read /proc/20353/exe
2013-03-31 17:21:11.570 (WARN ) [0x40001448c80] :ibm.runjob.LogSignalInfo: Permission denied
2013-03-31 17:21:11.571 (WARN ) [0x40001448c80] :ibm.runjob.LogSignalInfo: sent from uid 0 (root)
2013-03-31 17:21:13.374 (WARN ) [0x40001448c80] :360729:ibm.runjob.client.Job: terminated by signal 9
2013-03-31 17:21:13.375 (WARN ) [0x40001448c80] :360729:ibm.runjob.client.Job: abnormal termination by signal 9 from rank 36
2013-03-31 17:21:13.375 (WARN ) [0x40001448c80] :360729:ibm.runjob.client.Job: 7 RAS events
2013-03-31 17:21:13.375 (WARN ) [0x40001448c80] :360729:ibm.runjob.client.Job: most recent RAS event text: L1P Correctable Error Summary : count=3 cores=2,10,13 L1P_ESR : [ERR_RELOAD_ECC_X2] correctable reload data ECC error;

There are unfortauntely no timestamps produced in the code so I don't know how long it idled before it was killed at 17:21:11 but it should have exited cleanly here as it had done all 42 trajectories.

Maybe it got into a deadlock because I chose to manually edit .nstore_counter (due to the corrupt conf.0228) and I may have messed up some convention?

kostrzewa avatar Apr 05 '13 12:04 kostrzewa