`drmaa.JobState.DONE` with non-zero exit_status
Snakemake version 7.7.0 -- Cannot see any items in Changelog or edits to the source that would change the behaviour of this bug.
Summary
Using UGE cluster with --drmaa, I observe MissingOutputException for rules which have non-zero exit_status. I suspect this is due to the interpretation of drmaa.JobState.DONE.
Observation
When running jobs on the cluster I frequently see the following error. I increased the --latency-wait significantly (10 mins) but this did not resolve the issue so I decided to go probing.
MissingOutputException in line 88 of /path/to/rule.snake:
Job Missing files after 600 seconds.
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait;
<missing_paths_here>, but some output files are missing. 30
From logs, I saw that my jobs were failing for legitimate reasons but they weren't being interpreted as failures by Snakemake. Curious.
Inspecting the UGE qstat / qacct job output, I observe the following:
...
qsub_time 07/06/2022 18:42:44.521
start_time 07/06/2022 18:42:45.496
end_time 07/06/2022 18:43:57.083
granted_pe mt
slots 1
failed 0 <<<<
deleted_by NONE
exit_status 1 <<<<
ru_wallclock 71.587
ru_utime 65.676
ru_stime 3.748
ru_maxrss 1437448
....
This told me that Snakemake isn't picking up the exit_status=1 and is instead waiting for the files to appear. I then noted the failed=0 line and suspected that maybe Snakemake is using this flag to determine success.
To check this I injected some logging statements in DRMAAExecutor._wait_for_jobs():
https://github.com/snakemake/snakemake/blob/main/snakemake/executors/init.py#L1545
for active_job in active_jobs:
with self.status_rate_limiter:
try:
retval = self.session.jobStatus(active_job.jobid)
> logger.info(f"Job {active_job.job.jobid}, DRMAA id: {active_job.jobid,} status: {retval} - {active_job!r}")
> logger.info(f"{self.session.wait(active_job.jobid)}")
except drmaa.ExitTimeoutException as e:
Here we see that retval = "done"
Job 34, DRMAA id: ('8649716',) status: done
DRMAAClusterJob(
job=JOB_NAME, jobid='8649716',
callback=<bound method JobScheduler._proceed of <snakemake.scheduler.JobScheduler object at 0x7f4f17116b80>>,
error_callback=<bound method JobScheduler._error of <snakemake.scheduler.JobScheduler object at 0x7f4f17116b80>>,
jobscript='/path/to/script')
The self.session.wait call returns a JobInfo instance which shows more detailed information from the DRMAA library.
JobInfo(jobId='8649716',
hasExited=True,
hasSignal=False,
terminatedSignal='',
hasCoreDump=False,
wasAborted=False,
exitStatus=1, <<<<
resourceUsage={...}
)
Here we can see that the DRMAA library detects exitStatus=1 while giving a drmaa.JobState.DONE which is the flag interpreted by Snakemake for success.
This could be a bug in the DRMAA library but I believe it could be the interpretation of the drmaa.JobState.Done flag. The drmaa.JobState.Failed flag description is job exited abnormally before finishing where I guess "abnormally" is up for interpretation and does not necessarily mean a job which has a non-zero exitStatus.
I'm not sure what information I could provide to make this bug reproducible, but if there's any more information that I could give please let me know.
Edit: This behaviour is not observed when I run locally and Snakemake correctly detects the non-zero exit status
Any thoughts on this issue from anyone?