Alloc crash + new allocs have (deleted) in jobscript
Hi,
I've been running hq on a slurm based cluster, with autoallocs. Sometimes for some reason the server will become inactive/not reachable after which usually all my allocs have crashed. If I then restart an alloc to continue with the hq queue, no submission is successful because each autoalloc hq-submit.sh script now has hq (deleted) worker add... as the submission command.
Any ideas what might be going on? If I can help with some logs or something please let me know. Is it possible to change the hq worker command a server uses while submitting?
Cheers, Louis
Hi, that looks really weird. Basically, what the autoallocator does it that it first tries to discover the path to the hq binary from /proc/self/exe so that it knows how it can invoke hq by itself. It then uses this executable path to spawn new workers in Slurm/PBS scripts.
It seems that if the hq binary gets deleted in the meantime (which really shouldn't happen!), the printed path contains the (deleted) part, which shouldn't happen.
We will fix HQ so that it doesn't print the (deleted) part, however if the hq binary gets deleted (or is inaccessible), this won't help you. Do you have an idea why the hq binary might get deleted? Is the path under which the hq is located accessible from computing nodes? From which node do you run hq alloc, the login node?
Edit: I found that it's not strictly HQ's fault, it's just Linux returning <path> (deleted) from /proc/PID/exe if the binary gets deleted during the lifetime of the program. Peculiar.
Hi, Thanks for looking into this. This is very weird indeed as the exec is certainly still there after this happens. I honestly think this has something to do with how the shared filesystem is set up on this particular cluster, it very often behaves in very weird ways. Indeed the problem also just magically stopped happening... I think this can be closed therefore.
Thanks again, Louis
We will try to detect (deleted) in the executable path and provide a better warning to the user. I'll close this issue once it's implemented.