ebi-metagenomics-cwl
ebi-metagenomics-cwl copied to clipboard
toil with CWL on LSF status
- [x] ~~Must keep
--workdir
on a non shared filesystem like/tmp
until https://github.com/BD2KGenomics/toil/pull/1573 is merged (Might be better from a performance perspective anyway)~~ - [x] ~~Make sure to specify
--retries 1
or higher so that killed job get retried with at least the default memory (from--defaultMemory 10Gi
or similar) automatically~~ Nope, hand specify minimum memory and update those as jobs fail. - [x] Speaking of memory, add
ResourceRequirement
s with fixed ~~or dynamic~~ramMin
to all tools. - [x] test specifying
ResourceRequirement
s at the Workflow and WorkflowStep levels - [x] toil is experiencing a serialization bug, so don't use
format
with multiple options for inputs (for now) https://github.com/BD2KGenomics/toil/issues/1692 - [x] ~~
--preserve-environment
takes a space separated list of environment variables to preserve, not a comma separated list as the docs previously reported https://github.com/BD2KGenomics/toil/pull/1689~~ - [x] Use the
TOIL_LSF_ARGS
to specify the queue in your runscript:export TOIL_LSF_ARGS="-q production-rh7"
~~https://github.com/BD2KGenomics/toil/pull/1640~~ - [x] ~~There's an error in enumerating jobs in Toil 3.7.0, fix is at https://github.com/BD2KGenomics/toil/pull/1690~~
- [x] Toil doesn't have an override for
cwltool
's strict filename check, so be sure to strip out offending characters such as:
, example at https://github.com/ProteinsWebTeam/ebi-metagenomics-cwl/commit/767cc8f54cb26ad2b53c544c2e2054d8e7116a26 https://github.com/BD2KGenomics/toil/issues/1782 - [x] like most
cwltool
based CWL executors, you'll be happier if you set a dedicated output directory via--outdir
- [x] the CWL output object is written to
stdout
, so redirect that to a file for posterity (example:cwltoil […] | tee output
) - [x]
--restart
is handy for resuming a previous run, but (due to the lack of cache support while using the LSF batch system) any changes to the CWL descriptions will require a clean start - [x] apparently Toil will "make up" resource requirements on its own (randomly?) for tools without those annotations, so better be safe least
cat
get assigned 16 cores and 100GiB of memory :-) - [x] Toil runs testing using many batch systems (SLURM, Yarn, parasol, mesos, spark, GridEngine), but not LSF -- need to add setup instructions for spinning up a LSF cluster to https://github.com/BD2KGenomics/cgcloud/blob/master/jenkins/src/cgcloud/jenkins/toil_jenkins_slave.py
- [ ] Review Globus toolkit's LSF code for inspiration: https://github.com/globus/globus-toolkit/blob/globus_6_branch/gram/jobmanager/lrms/lsf/source/lsf.pm
- [ ] how to capture timestamps? they are output to stderr, but not in the on disk log
- [ ] how to capture output from LSF?
- [x] Is it possible to run the housekeeping jobs on the launcher node and not via cluster submission? (
CWLWorkflow
,ResolveIndirect
,CWLGather
,CWLScatter
, etc.. ) https://github.com/BD2KGenomics/toil/issues/1783 - [x] Restore usage of
InitialWorkDirRequirment
and confirm - [ ] write up the above lessons learned
- [ ] we don't request space in /tmp even though Toil does write there
- [x] Migrate Toil's LSF code to use AbstractGridEngineBatchSystem https://github.com/BD2KGenomics/toil/pull/2043
~~Current working branch will the bulk of the above fixed merged: https://github.com/mr-c/toil/tree/issues/1666-fail-not-on-unsubmitted-jobs~~ Latest Toil release has all of the above mentioned fixes merged
Note: In cwltoil
, sub-workflows must fully complete before any of their outputs are available for use by any other step/job. For example, the go_summary
in the functional analysis (IPS) workflow isn't subject to further processing, but its production holds up the availability of the functional_annotations
for futher processing by the parent workflow.
To run the CWL conformance tests using cwltoil
on LSF
virtualenv env
source env/bin/activate
pip install -U pip
pip install -U setuptools wheel
pip install .[cwl]
git clone https://github.com/common-workflow-language/common-workflow-language.git
cd common-workflow-language
pip install cwltest
TMP=$PWD ./run_test.sh RUNNER=toil-cwl-runner EXTRA="--batchSystem LSF --logDebug --logFile ${PWD}/log --disableCaching --user-space-docker-cmd=udocker" -j8"
(edited to use "
double quote instead of single with EXTRA
)
(edited to set TMP to a path on the shared filesystem, needed for cwltest)
Note for @mr-c : here's what I got from toil[cwl] running a workflow on a single machine, at some point I hear complaints about disk usage, although I never specified any requirements on that:
ripley 2017-06-08 18:50:26,180 Thread-82 WARNING toil.statsAndLogging: Got message from job at time 06-08-2017 18:50:26: Job used more disk than requested. Please reconsider modifying the user script to avoid the chance of failure due to incorrectly requested resources. Job 'file:///home/hmenager/ReproHackathon/reprohackathon1/cwl/tools/fastq-dump.cwl' fastq-dump 8/A/job6HSIGE used 128.93% (2.6 GB [2768723968B] used, 2.0 GB [2147483648B] requested) at the end of its run.
The tool itself is defined there: https://github.com/IFB-ElixirFr/ReproHackathon/blob/cwl/reprohackathon1/cwl/tools/fastq-dump.cwl
Priorities:
- [x] correct unit detection https://github.com/BD2KGenomics/toil/issues/1691, see https://github.com/BD2KGenomics/toil/pull/1762
- [x] move LSF to leverage the abstractGridEngineBatchSystem
- [x] fix dynamic resource requirements: https://github.com/BD2KGenomics/toil/issues/1647