QCFractal icon indicating copy to clipboard operation
QCFractal copied to clipboard

Clean up submitted LSF jobs on graceful termination

Open jchodera opened this issue 6 years ago • 3 comments

Describe the bug During my brief testing earlier today, when qcfractal-manager is terminated with ctrl-c, even though it terminates gracefully, it does not clean up the jobs it submitted to the LSF queue upon termination.

To Reproduce

  1. Start qcfractal-manager on an LSF system
  2. Terminate qcfractal-manager with ctrl-c
  3. Observe that jobs remain in the LSF queue with bjobs

Expected behavior If qcfractal-manager has the opportunity to terminate gracefully, it should bkill the jobs it submitted.

Additional context This was observed on the MSK lilac cluster.

jchodera avatar May 15 '19 00:05 jchodera

By and large these jobs are killed when the manager shuts down, but some do slip through for unknown reasons. When one of these jobs lands it will die in a few seconds as it is unable to find its scheduler so it should not be considered too harmful.

As we depend on other distributed queue managers for this capability we are unlikely to fix this ourselves. We can pass the bug report along to dask-jobqueue however.

dgasmith avatar May 15 '19 13:05 dgasmith

By and large these jobs are killed when the manager shuts down, but some do slip through for unknown reasons. When one of these jobs lands it will die in a few seconds as it is unable to find its scheduler so it should not be considered too harmful.

This is not the behavior I have observed. Queued jobs sit in the queue for some time. If a misconfiguration (such as the GB/MB/KB dask issue) led to the job requesting an unsatisfiable amount of resources, the job will sit in the queue indefinitely.

jchodera avatar May 15 '19 14:05 jchodera

Ok, good to know. This is likely a specific combination of dask and LSF as the above behavior works for SLURM/PBS/etc. We will be pushing some changes upstream to dask-jobqueue and will note this issue there.

dgasmith avatar May 15 '19 14:05 dgasmith

New manager is Parsl-based, which I think does a better job with LSF. But would need additional testing

bennybp avatar Sep 14 '23 16:09 bennybp