toil icon indicating copy to clipboard operation
toil copied to clipboard

When the jobstore isn't on a path available on all nodes, give a better error message

Open arnikz opened this issue 8 years ago • 3 comments

Hello,

I've pip install toil[cwl] (v3.8) and created toy example files echo-job.cwl and echo-job.yml:

cwlVersion: v1.0
class: CommandLineTool
baseCommand: echo
stdout: output.txt
inputs:
  message:
    type: string
    inputBinding:
      position: 1
outputs:
  output:
    type: stdout

message: Hello world!

submitted this CWL workflow to

  1. single machine

cwltoil echo-job.cwl echo-job.yml --workDir $(pwd) # OK, written output.txt file

  1. SGE cluster

cwltoil --batchSystem sge --maxCores 2 --maxMemory 1G --disableCaching --workDir $(pwd) echo-job.cwl echo-job.yml

jobs fail, no output.txt file

Traceback (most recent call last):
  File "/home/arnikz/miniconda2/envs/toil/bin/_toil_worker", line 11, in <module>
    sys.exit(main())
  File "/home/arnikz/miniconda2/envs/toil/lib/python2.7/site-packages/toil/worker.py", line 103, in main
    jobStore = Toil.resumeJobStore(jobStoreLocator)
  File "/home/arnikz/miniconda2/envs/toil/lib/python2.7/site-packages/toil/common.py", line 772, in resumeJobStore
    jobStore.resume()
  File "/home/arnikz/miniconda2/envs/toil/lib/python2.7/site-packages/toil/jobStores/fileJobStore.py", line 77, in resume
    raise NoSuchJobStoreException(self.jobStoreDir)
toil.jobStores.abstractJobStore.NoSuchJobStoreException: The job store '/tmp/tmpyxJBKg' does not exist, so there is nothing to restart
  1. SLURM cluster

cwltoil --batchSystem slurm --maxCores 2 --maxMemory 1G --disableCaching --workDir $(pwd) echo-job.cwl echo-job.yml

jobs fail but written (non-empty) output.txt file

Traceback (most recent call last):
  File "/home/arnikz/miniconda2/envs/toil/bin/_toil_worker", line 11, in <module>
    sys.exit(main())
  File "/home/arnikz/miniconda2/envs/toil/lib/python2.7/site-packages/toil/worker.py", line 103, in main
    jobStore = Toil.resumeJobStore(jobStoreLocator)
  File "/home/arnikz/miniconda2/envs/toil/lib/python2.7/site-packages/toil/common.py", line 772, in resumeJobStore
    jobStore.resume()
  File "/home/arnikz/miniconda2/envs/toil/lib/python2.7/site-packages/toil/jobStores/fileJobStore.py", line 77, in resume
    raise NoSuchJobStoreException(self.jobStoreDir)
toil.jobStores.abstractJobStore.NoSuchJobStoreException: The job store '/tmp/tmp5kPlBc' does not exist, so there is nothing to restart

Could you please help?

┆Issue is synchronized with this Jira Story ┆Issue Number: TOIL-177

arnikz avatar Jul 13 '17 14:07 arnikz

Hello @arnikz My guess is that /tmp was not present on all nodes of your cluster.

Here's how we run Toil @ EBI: https://github.com/ProteinsWebTeam/ebi-metagenomics-cwl/blob/master/workflows/run-toil-v4.sh

Note the use of --jobStore to specify a path known to present on all nodes

The error message should be updated to give this hint.

mr-c avatar Oct 31 '17 09:10 mr-c

Thanks @mr-c for suggestions! It worked out by setting the workDir to /tmp and adding the --jobStore path.

arnikz avatar Nov 02 '17 17:11 arnikz

➤ Adam Novak commented:

I think we probably still produce the same “nothing to restart” error from the worker, when a different phrasing would make more sense. We could catch the exception and throw a different one or log a critical message and quit.

unito-bot avatar May 07 '25 19:05 unito-bot