toil
toil copied to clipboard
Error/logging messages that may need work
Dumping my list of bad error/logging messages. Some are more egregious than others.
- [ ] Jobs "end successfully" even when they fail:
Job ended successfully: 'CactusBarWrapperWithPrecomputedEndAlignments' 3c056a1e-db38-4707-9fd2-c2b6885ea12e
Job 'CactusBarWrapperWithPrecomputedEndAlignments' 3c056a1e-db38-4707-9fd2-c2b6885ea12e continues to exist (i.e. has more to do)
Loaded job 3c056a1e-db38-4707-9fd2-c2b6885ea12e
Reading FileInfo(fileID='b214910d-f7c4-45da-9f94-045c49ead0b6', ownerID='3c056a1e-db38-4707-9fd2-c2b6885ea12e', encrypted=False, version='', previousVersion='', content='ing \nRunning co...b6885ea12e to 1\n', _numContentChunks=10) bound to <toil.jobStores.aws.jobStore.AWSJobStore object at 0x7f8c84c3f890> into stream.
The job seems to have left a log file, indicating failure: 'CactusBarWrapperWithPrecomputedEndAlignments' 3c056a1e-db38-4707-9fd2-c2b6885ea12e
- [x] Single machine resource exception gives absolutely no information on what job requested too many resources.
File "/opt/python/2.7.13/lib/python2.7/site-packages/toil/batchSystems/singleMachine.py", line 201, in issueBatchJob
self.checkResourceRequest(jobNode.memory, cores, jobNode.disk)
File "/opt/python/2.7.13/lib/python2.7/site-packages/toil/batchSystems/abstractBatchSystem.py", line 250, in checkResourceRequest
raise InsufficientSystemResources('disk', disk, self.maxDisk)
InsufficientSystemResources: Requesting more disk than either physically available, or enforced by --maxDisk. Requested: 10066329600, Available: 6589394944
- [ ] worker log output is different from worker log output during failure: normally it is prefixed by both the job name and job ID, but after failure it is prefixed by only the job ID.
- [x] toil ssh-cluster on nonexistent cluster raises IndexError instead of a useful error message:
(venv) ~/cactus > toil ssh-cluster -z us-west-2b non-existent
raven 2017-03-30 17:41:33,710 MainThread INFO toil.lib.bioio: Root logger is at level 'INFO', 'toil' logger at level 'INFO'.
raven 2017-03-30 17:41:33,710 MainThread INFO toil.lib.bioio: Root logger is at level 'INFO', 'toil' logger at level 'INFO'.
Traceback (most recent call last):
File "/home/joel/cactus/venv/bin/toil", line 11, in <module>
load_entry_point('toil==3.7.0a1', 'console_scripts', 'toil')()
File "/home/joel/cactus/venv/local/lib/python2.7/site-packages/toil/utils/toilMain.py", line 33, in main
module.main()
File "/home/joel/cactus/venv/local/lib/python2.7/site-packages/toil/utils/toilSSHCluster.py", line 34, in main
cluster.sshCluster(args=config.args)
File "/home/joel/cactus/venv/local/lib/python2.7/site-packages/toil/provisioners/__init__.py", line 68, in sshCluster
self.provisioner.sshLeader(self.clusterName, args, self.zone)
File "/home/joel/cactus/venv/local/lib/python2.7/site-packages/toil/provisioners/aws/awsProvisioner.py", line 139, in sshLeader
leader = cls._getLeader(clusterName, zone=zone)
File "/home/joel/cactus/venv/local/lib/python2.7/site-packages/toil/provisioners/aws/awsProvisioner.py", line 274, in _getLeader
leader = instances[0] # assume leader was launched first
IndexError: list index out of range
- [x] EOFError when unpickling job if promises aren't resolved properly?? Should have an error message that tells the user to add another level of indirection (#1607)
- [ ] If job subclass forgot to call
Job.__init__, all you get is this:
j/o/job4x2SHB INFO:toil.fileStore:Starting job ('A' j/o/job4x2SHB) with ID (ba43a76e76cf14d8100401016cab96e549ad94ca).
j/o/job4x2SHB Traceback (most recent call last):
j/o/job4x2SHB File "/home/joel/toil-fresh/venv/local/lib/python2.7/site-packages/toil/worker.py", line 340, in main
j/o/job4x2SHB job._runner(jobGraph=jobGraph, jobStore=jobStore, fileStore=fileStore)
j/o/job4x2SHB File "/home/joel/toil-fresh/venv/local/lib/python2.7/site-packages/toil/job.py", line 1286, in _runner
j/o/job4x2SHB returnValues = self._run(jobGraph, fileStore)
j/o/job4x2SHB File "/home/joel/toil-fresh/venv/local/lib/python2.7/site-packages/toil/job.py", line 1231, in _run
j/o/job4x2SHB return self.run(fileStore)
j/o/job4x2SHB File "/home/joel/toil-fresh/testPromiseWiringError.py", line 9, in run
j/o/job4x2SHB self.addChild(C(promise))
j/o/job4x2SHB File "/home/joel/toil-fresh/venv/local/lib/python2.7/site-packages/toil/job.py", line 320, in addChild
j/o/job4x2SHB childJob._addPredecessor(self)
j/o/job4x2SHB File "/home/joel/toil-fresh/venv/local/lib/python2.7/site-packages/toil/job.py", line 824, in _addPredecessor
j/o/job4x2SHB if predecessorJob in self._directPredecessors:
j/o/job4x2SHB AttributeError: 'C' object has no attribute '_directPredecessors'
j/o/job4x2SHB ERROR:toil.worker:Exiting the worker because of a failed job on host jabberwocky
Should just display a useful hint if the job doesn't have that field.
- [ ] If user accidentally hands a job class instead of a job instance:
Traceback (most recent call last):
File "/home/joel/toil-fresh/venv/local/lib/python2.7/site-packages/toil/worker.py", line 340, in main
job._runner(jobGraph=jobGraph, jobStore=jobStore, fileStore=fileStore)
File "/home/joel/toil-fresh/venv/local/lib/python2.7/site-packages/toil/job.py", line 1286, in _runner
returnValues = self._run(jobGraph, fileStore)
File "/home/joel/toil-fresh/venv/local/lib/python2.7/site-packages/toil/job.py", line 1231, in _run
return self.run(fileStore)
File "/home/joel/toil-fresh/testPromiseWiringError.py", line 8, in run
promise = self.addChild(B).rv()
File "/home/joel/toil-fresh/venv/local/lib/python2.7/site-packages/toil/job.py", line 320, in addChild
childJob._addPredecessor(self)
TypeError: unbound method _addPredecessor() must be called with B instance as first argument (got A instance instead)
- [x] Mesos resource requirements aren't logged properly:
Offer f9c1e779-505b-4fdd-8cd2-3bb0c225f0fa-O19950 not suitable to run the tasks with requirements <toil.batchSystems.mesos.R
esourceRequirement object at 0x7f8fc4b23a90>. Mesos offered 2581594112.0 memory, 22.0 cores and 5.26931460096e+11 of disk on
a preemptable slave.
- [ ] Debug log does not indicate job when failing to run job on singleMachine.
Could not acquire enough (memory) to run job. Requested: (7759462400), Avaliable: 7629611854. Sleeping for 10s.
- [ ] (nitpick) JobStoreExistsException doesn't prefix job store with 'aws:', 'file:', etc. for easy copy and paste into toil clean.
toil.jobStores.abstractJobStore.JobStoreExistsException: The job store 'us-west-2:jcarmstr-mesos-partition' already exists. Use --restart to resume the workflow, or remove the job store with 'toil clean' to start the workflow from scratch
- [ ] if you forget to add .rv() when passing a promise, you just get this:
Y/k/jobdWlgUP Traceback (most recent call last):
Y/k/jobdWlgUP File "/home/joel/cactus/venv/local/lib/python2.7/site-packages/toil/worker.py", line 336, in main
Y/k/jobdWlgUP job._runner(jobGraph=jobGraph, jobStore=jobStore, fileStore=fileStore)
Y/k/jobdWlgUP File "/home/joel/cactus/venv/local/lib/python2.7/site-packages/toil/job.py", line 1292, in _runner
Y/k/jobdWlgUP self._serialiseExistingJob(jobGraph, jobStore, returnValues)
Y/k/jobdWlgUP File "/home/joel/cactus/venv/local/lib/python2.7/site-packages/toil/job.py", line 1214, in _serialiseExistingJob
Y/k/jobdWlgUP self._serialiseJobGraph(jobGraph, jobStore, returnValues, False)
Y/k/jobdWlgUP File "/home/joel/cactus/venv/local/lib/python2.7/site-packages/toil/job.py", line 1183, in _serialiseJobGraph
Y/k/jobdWlgUP job._serialiseJob(jobStore, jobsToJobGraphs, jobGraph)
Y/k/jobdWlgUP File "/home/joel/cactus/venv/local/lib/python2.7/site-packages/toil/job.py", line 1066, in _serialiseJob
Y/k/jobdWlgUP cPickle.dump(self, fileHandle, cPickle.HIGHEST_PROTOCOL)
Y/k/jobdWlgUP PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed
- [x] filestore NoSuchFileException repeats the message twice:
kolossus 2017-05-10 22:14:59,551 MainThread WARNING toil.leader: 6/A/jobvU55HL File "/cluster/home/jcarmstr/venv/lib/python2.7/site-packages/toil/jobStores/fileJobStore.py", line 378, in _checkJobStoreFileID
kolossus 2017-05-10 22:14:59,551 MainThread WARNING toil.leader: 6/A/jobvU55HL raise NoSuchFileException("File %s does not exist in jobStore" % jobStoreFileID)
kolossus 2017-05-10 22:14:59,551 MainThread WARNING toil.leader: 6/A/jobvU55HL NoSuchFileException: File 'File Anc2.fa does not exist in jobStore' does not exist
┆Issue is synchronized with this Jira Story ┆Epic: Improve debugging experience ┆Issue Number: TOIL-170
toil ssh-cluster on nonexistent cluster raises IndexError instead of a useful error message:
It looks like this behavior is observed with almost all if not all of the Toil utilities. destroy-cluster and clean handle this silently:
$ toil destroy-cluster does-not-exist
UCSC-amerob-lt.local 2017-06-09 15:06:44,663 MainThread INFO toil.lib.bioio: Root logger is at level 'INFO', 'toil' logger at level 'INFO'.
UCSC-amerob-lt.local 2017-06-09 15:06:50,258 MainThread INFO toil.provisioners.aws.awsProvisioner: Deleting security group...
ssh-cluster and rsync-cluster are uninformative:
$ toil ssh-cluster does-not-exist
UCSC-amerob-lt.local 2017-06-09 15:07:40,381 MainThread INFO toil.lib.bioio: Root logger is at level 'INFO', 'toil' logger at level 'INFO'.
Traceback (most recent call last):
File "/Users/natan/Projects/toil/venv/bin/toil", line 11, in <module>
load_entry_point('toil', 'console_scripts', 'toil')()
File "/Users/natan/Projects/toil/src/toil/utils/toilMain.py", line 33, in main
module.main()
File "/Users/natan/Projects/toil/src/toil/utils/toilSSHCluster.py", line 33, in main
cluster.sshCluster(args=config.args)
File "/Users/natan/Projects/toil/src/toil/provisioners/__init__.py", line 74, in sshCluster
self.provisioner.sshLeader(self.clusterName, args, self.zone)
File "/Users/natan/Projects/toil/src/toil/provisioners/aws/awsProvisioner.py", line 217, in sshLeader
leader = cls._getLeader(clusterName, zone=zone)
File "/Users/natan/Projects/toil/src/toil/provisioners/aws/awsProvisioner.py", line 495, in _getLeader
leader = instances[0] # assume leader was launched first
IndexError: list index out of range
status, stats, and kill all handle this correctly:
$ toil kill does-not-exist
UCSC-amerob-lt.local 2017-06-09 15:09:07,310 MainThread INFO toil.lib.bioio: Root logger is at level 'INFO', 'toil' logger at level 'INFO'.
Traceback (most recent call last):
File "/Users/natan/Projects/toil/venv/bin/toil", line 11, in <module>
load_entry_point('toil', 'console_scripts', 'toil')()
File "/Users/natan/Projects/toil/src/toil/utils/toilMain.py", line 33, in main
module.main()
File "/Users/natan/Projects/toil/src/toil/utils/toilKill.py", line 37, in main
jobStore = Toil.resumeJobStore(config.jobStore)
File "/Users/natan/Projects/toil/src/toil/common.py", line 772, in resumeJobStore
jobStore.resume()
File "/Users/natan/Projects/toil/src/toil/jobStores/fileJobStore.py", line 77, in resume
raise NoSuchJobStoreException(self.jobStoreDir)
toil.jobStores.abstractJobStore.NoSuchJobStoreException: The job store '/Users/natan/Projects/toil/does-not-exist' does not exist, so there is nothing to restart
➤ Melaina Legaspi commented:
Mark Diekhans How do you feel about taking this task on?
➤ Adam Novak commented:
I think we should go through this, see which are not yet fixed/plausibly still broken, and turn those into their own issues. Then we can close this.