When the jobstore isn't on a path available on all nodes, give a better error message
Hello,
I've pip install toil[cwl] (v3.8) and created toy example files echo-job.cwl and echo-job.yml:
cwlVersion: v1.0
class: CommandLineTool
baseCommand: echo
stdout: output.txt
inputs:
message:
type: string
inputBinding:
position: 1
outputs:
output:
type: stdout
message: Hello world!
submitted this CWL workflow to
- single machine
cwltoil echo-job.cwl echo-job.yml --workDir $(pwd) # OK, written output.txt file
- SGE cluster
cwltoil --batchSystem sge --maxCores 2 --maxMemory 1G --disableCaching --workDir $(pwd) echo-job.cwl echo-job.yml
jobs fail, no output.txt file
Traceback (most recent call last):
File "/home/arnikz/miniconda2/envs/toil/bin/_toil_worker", line 11, in <module>
sys.exit(main())
File "/home/arnikz/miniconda2/envs/toil/lib/python2.7/site-packages/toil/worker.py", line 103, in main
jobStore = Toil.resumeJobStore(jobStoreLocator)
File "/home/arnikz/miniconda2/envs/toil/lib/python2.7/site-packages/toil/common.py", line 772, in resumeJobStore
jobStore.resume()
File "/home/arnikz/miniconda2/envs/toil/lib/python2.7/site-packages/toil/jobStores/fileJobStore.py", line 77, in resume
raise NoSuchJobStoreException(self.jobStoreDir)
toil.jobStores.abstractJobStore.NoSuchJobStoreException: The job store '/tmp/tmpyxJBKg' does not exist, so there is nothing to restart
- SLURM cluster
cwltoil --batchSystem slurm --maxCores 2 --maxMemory 1G --disableCaching --workDir $(pwd) echo-job.cwl echo-job.yml
jobs fail but written (non-empty) output.txt file
Traceback (most recent call last):
File "/home/arnikz/miniconda2/envs/toil/bin/_toil_worker", line 11, in <module>
sys.exit(main())
File "/home/arnikz/miniconda2/envs/toil/lib/python2.7/site-packages/toil/worker.py", line 103, in main
jobStore = Toil.resumeJobStore(jobStoreLocator)
File "/home/arnikz/miniconda2/envs/toil/lib/python2.7/site-packages/toil/common.py", line 772, in resumeJobStore
jobStore.resume()
File "/home/arnikz/miniconda2/envs/toil/lib/python2.7/site-packages/toil/jobStores/fileJobStore.py", line 77, in resume
raise NoSuchJobStoreException(self.jobStoreDir)
toil.jobStores.abstractJobStore.NoSuchJobStoreException: The job store '/tmp/tmp5kPlBc' does not exist, so there is nothing to restart
Could you please help?
┆Issue is synchronized with this Jira Story ┆Issue Number: TOIL-177
Hello @arnikz
My guess is that /tmp was not present on all nodes of your cluster.
Here's how we run Toil @ EBI: https://github.com/ProteinsWebTeam/ebi-metagenomics-cwl/blob/master/workflows/run-toil-v4.sh
Note the use of --jobStore to specify a path known to present on all nodes
The error message should be updated to give this hint.
Thanks @mr-c for suggestions! It worked out by setting the workDir to /tmp and adding the --jobStore path.
➤ Adam Novak commented:
I think we probably still produce the same “nothing to restart” error from the worker, when a different phrasing would make more sense. We could catch the exception and throw a different one or log a critical message and quit.