QCFractal
QCFractal copied to clipboard
Clean up submitted LSF jobs on graceful termination
Describe the bug
During my brief testing earlier today, when qcfractal-manager is terminated with ctrl-c, even though it terminates gracefully, it does not clean up the jobs it submitted to the LSF queue upon termination.
To Reproduce
- Start
qcfractal-manageron an LSF system - Terminate
qcfractal-managerwithctrl-c - Observe that jobs remain in the LSF queue with
bjobs
Expected behavior
If qcfractal-manager has the opportunity to terminate gracefully, it should bkill the jobs it submitted.
Additional context
This was observed on the MSK lilac cluster.
By and large these jobs are killed when the manager shuts down, but some do slip through for unknown reasons. When one of these jobs lands it will die in a few seconds as it is unable to find its scheduler so it should not be considered too harmful.
As we depend on other distributed queue managers for this capability we are unlikely to fix this ourselves. We can pass the bug report along to dask-jobqueue however.
By and large these jobs are killed when the manager shuts down, but some do slip through for unknown reasons. When one of these jobs lands it will die in a few seconds as it is unable to find its scheduler so it should not be considered too harmful.
This is not the behavior I have observed. Queued jobs sit in the queue for some time. If a misconfiguration (such as the GB/MB/KB dask issue) led to the job requesting an unsatisfiable amount of resources, the job will sit in the queue indefinitely.
Ok, good to know. This is likely a specific combination of dask and LSF as the above behavior works for SLURM/PBS/etc. We will be pushing some changes upstream to dask-jobqueue and will note this issue there.
New manager is Parsl-based, which I think does a better job with LSF. But would need additional testing