`drmaa.JobState.DONE` with non-zero exit_status

Open HalfPhoton opened this issue 3 years ago • 1 comments

Snakemake version 7.7.0 -- Cannot see any items in Changelog or edits to the source that would change the behaviour of this bug.

Summary Using UGE cluster with --drmaa, I observe MissingOutputException for rules which have non-zero exit_status. I suspect this is due to the interpretation of drmaa.JobState.DONE.

Observation When running jobs on the cluster I frequently see the following error. I increased the --latency-wait significantly (10 mins) but this did not resolve the issue so I decided to go probing.

MissingOutputException in line 88 of /path/to/rule.snake:

Job Missing files after 600 seconds. 

This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait; 
<missing_paths_here>, but some output files are missing. 30

From logs, I saw that my jobs were failing for legitimate reasons but they weren't being interpreted as failures by Snakemake. Curious.

Inspecting the UGE qstat / qacct job output, I observe the following:

...
qsub_time    07/06/2022 18:42:44.521
start_time   07/06/2022 18:42:45.496
end_time     07/06/2022 18:43:57.083
granted_pe   mt
slots        1
failed       0        <<<<
deleted_by   NONE
exit_status  1        <<<<
ru_wallclock 71.587
ru_utime     65.676
ru_stime     3.748
ru_maxrss    1437448
....

This told me that Snakemake isn't picking up the exit_status=1 and is instead waiting for the files to appear. I then noted the failed=0 line and suspected that maybe Snakemake is using this flag to determine success.

To check this I injected some logging statements in DRMAAExecutor._wait_for_jobs():

https://github.com/snakemake/snakemake/blob/main/snakemake/executors/init.py#L1545

for active_job in active_jobs:
    with self.status_rate_limiter:
        try:
           retval = self.session.jobStatus(active_job.jobid)
>          logger.info(f"Job {active_job.job.jobid}, DRMAA id: {active_job.jobid,}  status: {retval} - {active_job!r}")
>          logger.info(f"{self.session.wait(active_job.jobid)}")
        except drmaa.ExitTimeoutException as e:

Here we see that retval = "done"

Job 34, DRMAA id: ('8649716',)  status: done
DRMAAClusterJob(
job=JOB_NAME, jobid='8649716', 
callback=<bound method JobScheduler._proceed of <snakemake.scheduler.JobScheduler object at 0x7f4f17116b80>>, 
error_callback=<bound method JobScheduler._error of <snakemake.scheduler.JobScheduler object at 0x7f4f17116b80>>, 
jobscript='/path/to/script')

The self.session.wait call returns a JobInfo instance which shows more detailed information from the DRMAA library.

JobInfo(jobId='8649716', 
hasExited=True, 
hasSignal=False, 
terminatedSignal='', 
hasCoreDump=False,
wasAborted=False,
exitStatus=1,          <<<<
resourceUsage={...}
)

Here we can see that the DRMAA library detects exitStatus=1 while giving a drmaa.JobState.DONE which is the flag interpreted by Snakemake for success.

This could be a bug in the DRMAA library but I believe it could be the interpretation of the drmaa.JobState.Done flag. The drmaa.JobState.Failed flag description is job exited abnormally before finishing where I guess "abnormally" is up for interpretation and does not necessarily mean a job which has a non-zero exitStatus.

I'm not sure what information I could provide to make this bug reproducible, but if there's any more information that I could give please let me know.

Edit: This behaviour is not observed when I run locally and Snakemake correctly detects the non-zero exit status

Jul 07 '22 15:07 HalfPhoton

Any thoughts on this issue from anyone?

Sep 05 '22 08:09 HalfPhoton