toil Error/logging messages that may need work

trafficstars

Dumping my list of bad error/logging messages. Some are more egregious than others.

[ ] Jobs "end successfully" even when they fail:

Job ended successfully: 'CactusBarWrapperWithPrecomputedEndAlignments' 3c056a1e-db38-4707-9fd2-c2b6885ea12e
Job 'CactusBarWrapperWithPrecomputedEndAlignments' 3c056a1e-db38-4707-9fd2-c2b6885ea12e continues to exist (i.e. has more to do)
Loaded job 3c056a1e-db38-4707-9fd2-c2b6885ea12e
Reading FileInfo(fileID='b214910d-f7c4-45da-9f94-045c49ead0b6', ownerID='3c056a1e-db38-4707-9fd2-c2b6885ea12e', encrypted=False, version='', previousVersion='', content='ing \nRunning co...b6885ea12e to 1\n', _numContentChunks=10) bound to <toil.jobStores.aws.jobStore.AWSJobStore object at 0x7f8c84c3f890> into stream.
The job seems to have left a log file, indicating failure: 'CactusBarWrapperWithPrecomputedEndAlignments' 3c056a1e-db38-4707-9fd2-c2b6885ea12e

[x] Single machine resource exception gives absolutely no information on what job requested too many resources.

  File "/opt/python/2.7.13/lib/python2.7/site-packages/toil/batchSystems/singleMachine.py", line 201, in issueBatchJob
    self.checkResourceRequest(jobNode.memory, cores, jobNode.disk)
  File "/opt/python/2.7.13/lib/python2.7/site-packages/toil/batchSystems/abstractBatchSystem.py", line 250, in checkResourceRequest
    raise InsufficientSystemResources('disk', disk, self.maxDisk)
InsufficientSystemResources: Requesting more disk than either physically available, or enforced by --maxDisk. Requested: 10066329600, Available: 6589394944

[ ] worker log output is different from worker log output during failure: normally it is prefixed by both the job name and job ID, but after failure it is prefixed by only the job ID.
[x] toil ssh-cluster on nonexistent cluster raises IndexError instead of a useful error message:

(venv)  ~/cactus  > toil ssh-cluster -z us-west-2b non-existent
raven 2017-03-30 17:41:33,710 MainThread INFO toil.lib.bioio: Root logger is at level 'INFO', 'toil' logger at level 'INFO'.
raven 2017-03-30 17:41:33,710 MainThread INFO toil.lib.bioio: Root logger is at level 'INFO', 'toil' logger at level 'INFO'.
Traceback (most recent call last):
  File "/home/joel/cactus/venv/bin/toil", line 11, in <module>
    load_entry_point('toil==3.7.0a1', 'console_scripts', 'toil')()
  File "/home/joel/cactus/venv/local/lib/python2.7/site-packages/toil/utils/toilMain.py", line 33, in main
    module.main()
  File "/home/joel/cactus/venv/local/lib/python2.7/site-packages/toil/utils/toilSSHCluster.py", line 34, in main
    cluster.sshCluster(args=config.args)
  File "/home/joel/cactus/venv/local/lib/python2.7/site-packages/toil/provisioners/__init__.py", line 68, in sshCluster
    self.provisioner.sshLeader(self.clusterName, args, self.zone)
  File "/home/joel/cactus/venv/local/lib/python2.7/site-packages/toil/provisioners/aws/awsProvisioner.py", line 139, in sshLeader
    leader = cls._getLeader(clusterName, zone=zone)
  File "/home/joel/cactus/venv/local/lib/python2.7/site-packages/toil/provisioners/aws/awsProvisioner.py", line 274, in _getLeader
    leader = instances[0]  # assume leader was launched first
IndexError: list index out of range

[x] EOFError when unpickling job if promises aren't resolved properly?? Should have an error message that tells the user to add another level of indirection (#1607)
[ ] If job subclass forgot to call Job.__init__, all you get is this:

j/o/job4x2SHB    INFO:toil.fileStore:Starting job ('A' j/o/job4x2SHB) with ID (ba43a76e76cf14d8100401016cab96e549ad94ca).
j/o/job4x2SHB    Traceback (most recent call last):
j/o/job4x2SHB      File "/home/joel/toil-fresh/venv/local/lib/python2.7/site-packages/toil/worker.py", line 340, in main
j/o/job4x2SHB        job._runner(jobGraph=jobGraph, jobStore=jobStore, fileStore=fileStore)
j/o/job4x2SHB      File "/home/joel/toil-fresh/venv/local/lib/python2.7/site-packages/toil/job.py", line 1286, in _runner
j/o/job4x2SHB        returnValues = self._run(jobGraph, fileStore)
j/o/job4x2SHB      File "/home/joel/toil-fresh/venv/local/lib/python2.7/site-packages/toil/job.py", line 1231, in _run
j/o/job4x2SHB        return self.run(fileStore)
j/o/job4x2SHB      File "/home/joel/toil-fresh/testPromiseWiringError.py", line 9, in run
j/o/job4x2SHB        self.addChild(C(promise))
j/o/job4x2SHB      File "/home/joel/toil-fresh/venv/local/lib/python2.7/site-packages/toil/job.py", line 320, in addChild
j/o/job4x2SHB        childJob._addPredecessor(self)
j/o/job4x2SHB      File "/home/joel/toil-fresh/venv/local/lib/python2.7/site-packages/toil/job.py", line 824, in _addPredecessor
j/o/job4x2SHB        if predecessorJob in self._directPredecessors:
j/o/job4x2SHB    AttributeError: 'C' object has no attribute '_directPredecessors'
j/o/job4x2SHB    ERROR:toil.worker:Exiting the worker because of a failed job on host jabberwocky

Should just display a useful hint if the job doesn't have that field.

[ ] If user accidentally hands a job class instead of a job instance:

Traceback (most recent call last):
  File "/home/joel/toil-fresh/venv/local/lib/python2.7/site-packages/toil/worker.py", line 340, in main
    job._runner(jobGraph=jobGraph, jobStore=jobStore, fileStore=fileStore)
  File "/home/joel/toil-fresh/venv/local/lib/python2.7/site-packages/toil/job.py", line 1286, in _runner
    returnValues = self._run(jobGraph, fileStore)
  File "/home/joel/toil-fresh/venv/local/lib/python2.7/site-packages/toil/job.py", line 1231, in _run
    return self.run(fileStore)
  File "/home/joel/toil-fresh/testPromiseWiringError.py", line 8, in run
    promise = self.addChild(B).rv()
  File "/home/joel/toil-fresh/venv/local/lib/python2.7/site-packages/toil/job.py", line 320, in addChild
    childJob._addPredecessor(self)
TypeError: unbound method _addPredecessor() must be called with B instance as first argument (got A instance instead)

[x] Mesos resource requirements aren't logged properly:

Offer f9c1e779-505b-4fdd-8cd2-3bb0c225f0fa-O19950 not suitable to run the tasks with requirements <toil.batchSystems.mesos.R
esourceRequirement object at 0x7f8fc4b23a90>. Mesos offered 2581594112.0 memory, 22.0 cores and 5.26931460096e+11 of disk on
 a preemptable slave.

[ ] Debug log does not indicate job when failing to run job on singleMachine.

Could not acquire enough (memory) to run job. Requested: (7759462400), Avaliable: 7629611854. Sleeping for 10s.

[ ] (nitpick) JobStoreExistsException doesn't prefix job store with 'aws:', 'file:', etc. for easy copy and paste into toil clean.

toil.jobStores.abstractJobStore.JobStoreExistsException: The job store 'us-west-2:jcarmstr-mesos-partition' already exists. Use --restart to resume the workflow, or remove the job store with 'toil clean' to start the workflow from scratch

[ ] if you forget to add .rv() when passing a promise, you just get this:

Y/k/jobdWlgUP    Traceback (most recent call last):
Y/k/jobdWlgUP      File "/home/joel/cactus/venv/local/lib/python2.7/site-packages/toil/worker.py", line 336, in main
Y/k/jobdWlgUP        job._runner(jobGraph=jobGraph, jobStore=jobStore, fileStore=fileStore)
Y/k/jobdWlgUP      File "/home/joel/cactus/venv/local/lib/python2.7/site-packages/toil/job.py", line 1292, in _runner
Y/k/jobdWlgUP        self._serialiseExistingJob(jobGraph, jobStore, returnValues)
Y/k/jobdWlgUP      File "/home/joel/cactus/venv/local/lib/python2.7/site-packages/toil/job.py", line 1214, in _serialiseExistingJob
Y/k/jobdWlgUP        self._serialiseJobGraph(jobGraph, jobStore, returnValues, False)
Y/k/jobdWlgUP      File "/home/joel/cactus/venv/local/lib/python2.7/site-packages/toil/job.py", line 1183, in _serialiseJobGraph
Y/k/jobdWlgUP        job._serialiseJob(jobStore, jobsToJobGraphs, jobGraph)
Y/k/jobdWlgUP      File "/home/joel/cactus/venv/local/lib/python2.7/site-packages/toil/job.py", line 1066, in _serialiseJob
Y/k/jobdWlgUP        cPickle.dump(self, fileHandle, cPickle.HIGHEST_PROTOCOL)
Y/k/jobdWlgUP    PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed

[x] filestore NoSuchFileException repeats the message twice:

kolossus 2017-05-10 22:14:59,551 MainThread WARNING toil.leader: 6/A/jobvU55HL      File "/cluster/home/jcarmstr/venv/lib/python2.7/site-packages/toil/jobStores/fileJobStore.py", line 378, in _checkJobStoreFileID
kolossus 2017-05-10 22:14:59,551 MainThread WARNING toil.leader: 6/A/jobvU55HL        raise NoSuchFileException("File %s does not exist in jobStore" % jobStoreFileID)
kolossus 2017-05-10 22:14:59,551 MainThread WARNING toil.leader: 6/A/jobvU55HL    NoSuchFileException: File 'File Anc2.fa does not exist in jobStore' does not exist

┆Issue is synchronized with this Jira Story ┆Epic: Improve debugging experience ┆Issue Number: TOIL-170

Jun 08 '17 17:06 joelarmstrong

toil ssh-cluster on nonexistent cluster raises IndexError instead of a useful error message:

It looks like this behavior is observed with almost all if not all of the Toil utilities. destroy-cluster and clean handle this silently:

$ toil destroy-cluster does-not-exist
UCSC-amerob-lt.local 2017-06-09 15:06:44,663 MainThread INFO toil.lib.bioio: Root logger is at level 'INFO', 'toil' logger at level 'INFO'.
UCSC-amerob-lt.local 2017-06-09 15:06:50,258 MainThread INFO toil.provisioners.aws.awsProvisioner: Deleting security group...

ssh-cluster and rsync-cluster are uninformative:

$ toil ssh-cluster does-not-exist
UCSC-amerob-lt.local 2017-06-09 15:07:40,381 MainThread INFO toil.lib.bioio: Root logger is at level 'INFO', 'toil' logger at level 'INFO'.
Traceback (most recent call last):
  File "/Users/natan/Projects/toil/venv/bin/toil", line 11, in <module>
    load_entry_point('toil', 'console_scripts', 'toil')()
  File "/Users/natan/Projects/toil/src/toil/utils/toilMain.py", line 33, in main
    module.main()
  File "/Users/natan/Projects/toil/src/toil/utils/toilSSHCluster.py", line 33, in main
    cluster.sshCluster(args=config.args)
  File "/Users/natan/Projects/toil/src/toil/provisioners/__init__.py", line 74, in sshCluster
    self.provisioner.sshLeader(self.clusterName, args, self.zone)
  File "/Users/natan/Projects/toil/src/toil/provisioners/aws/awsProvisioner.py", line 217, in sshLeader
    leader = cls._getLeader(clusterName, zone=zone)
  File "/Users/natan/Projects/toil/src/toil/provisioners/aws/awsProvisioner.py", line 495, in _getLeader
    leader = instances[0]  # assume leader was launched first
IndexError: list index out of range

status, stats, and kill all handle this correctly:

$ toil kill does-not-exist
UCSC-amerob-lt.local 2017-06-09 15:09:07,310 MainThread INFO toil.lib.bioio: Root logger is at level 'INFO', 'toil' logger at level 'INFO'.
Traceback (most recent call last):
  File "/Users/natan/Projects/toil/venv/bin/toil", line 11, in <module>
    load_entry_point('toil', 'console_scripts', 'toil')()
  File "/Users/natan/Projects/toil/src/toil/utils/toilMain.py", line 33, in main
    module.main()
  File "/Users/natan/Projects/toil/src/toil/utils/toilKill.py", line 37, in main
    jobStore = Toil.resumeJobStore(config.jobStore)
  File "/Users/natan/Projects/toil/src/toil/common.py", line 772, in resumeJobStore
    jobStore.resume()
  File "/Users/natan/Projects/toil/src/toil/jobStores/fileJobStore.py", line 77, in resume
    raise NoSuchJobStoreException(self.jobStoreDir)
toil.jobStores.abstractJobStore.NoSuchJobStoreException: The job store '/Users/natan/Projects/toil/does-not-exist' does not exist, so there is nothing to restart

Jun 09 '17 07:06 natanlao

➤ Melaina Legaspi commented:

Mark Diekhans How do you feel about taking this task on?

Oct 06 '20 17:10 unito-bot

➤ Adam Novak commented:

I think we should go through this, see which are not yet fixed/plausibly still broken, and turn those into their own issues. Then we can close this.

Mar 05 '24 18:03 unito-bot

toil toil copied to clipboard

Error/logging messages that may need work

toil
toil copied to clipboard