ebi-metagenomics-cwl icon indicating copy to clipboard operation
ebi-metagenomics-cwl copied to clipboard

toil with CWL on LSF status

Open mr-c opened this issue 7 years ago • 4 comments

  • [x] ~~Must keep --workdir on a non shared filesystem like /tmp until https://github.com/BD2KGenomics/toil/pull/1573 is merged (Might be better from a performance perspective anyway)~~
  • [x] ~~Make sure to specify --retries 1 or higher so that killed job get retried with at least the default memory (from --defaultMemory 10Gi or similar) automatically~~ Nope, hand specify minimum memory and update those as jobs fail.
  • [x] Speaking of memory, add ResourceRequirements with fixed ~~or dynamic~~ ramMin to all tools.
  • [x] test specifying ResourceRequirements at the Workflow and WorkflowStep levels
  • [x] toil is experiencing a serialization bug, so don't use format with multiple options for inputs (for now) https://github.com/BD2KGenomics/toil/issues/1692
  • [x] ~~--preserve-environment takes a space separated list of environment variables to preserve, not a comma separated list as the docs previously reported https://github.com/BD2KGenomics/toil/pull/1689~~
  • [x] Use the TOIL_LSF_ARGS to specify the queue in your runscript: export TOIL_LSF_ARGS="-q production-rh7" ~~https://github.com/BD2KGenomics/toil/pull/1640~~
  • [x] ~~There's an error in enumerating jobs in Toil 3.7.0, fix is at https://github.com/BD2KGenomics/toil/pull/1690~~
  • [x] Toil doesn't have an override for cwltool's strict filename check, so be sure to strip out offending characters such as :, example at https://github.com/ProteinsWebTeam/ebi-metagenomics-cwl/commit/767cc8f54cb26ad2b53c544c2e2054d8e7116a26 https://github.com/BD2KGenomics/toil/issues/1782
  • [x] like most cwltool based CWL executors, you'll be happier if you set a dedicated output directory via --outdir
  • [x] the CWL output object is written to stdout, so redirect that to a file for posterity (example: cwltoil […] | tee output)
  • [x] --restart is handy for resuming a previous run, but (due to the lack of cache support while using the LSF batch system) any changes to the CWL descriptions will require a clean start
  • [x] apparently Toil will "make up" resource requirements on its own (randomly?) for tools without those annotations, so better be safe least cat get assigned 16 cores and 100GiB of memory :-)
  • [x] Toil runs testing using many batch systems (SLURM, Yarn, parasol, mesos, spark, GridEngine), but not LSF -- need to add setup instructions for spinning up a LSF cluster to https://github.com/BD2KGenomics/cgcloud/blob/master/jenkins/src/cgcloud/jenkins/toil_jenkins_slave.py
  • [ ] Review Globus toolkit's LSF code for inspiration: https://github.com/globus/globus-toolkit/blob/globus_6_branch/gram/jobmanager/lrms/lsf/source/lsf.pm
  • [ ] how to capture timestamps? they are output to stderr, but not in the on disk log
  • [ ] how to capture output from LSF?
  • [x] Is it possible to run the housekeeping jobs on the launcher node and not via cluster submission? (CWLWorkflow, ResolveIndirect, CWLGather, CWLScatter, etc.. ) https://github.com/BD2KGenomics/toil/issues/1783
  • [x] Restore usage of InitialWorkDirRequirment and confirm
  • [ ] write up the above lessons learned
  • [ ] we don't request space in /tmp even though Toil does write there
  • [x] Migrate Toil's LSF code to use AbstractGridEngineBatchSystem https://github.com/BD2KGenomics/toil/pull/2043

~~Current working branch will the bulk of the above fixed merged: https://github.com/mr-c/toil/tree/issues/1666-fail-not-on-unsubmitted-jobs~~ Latest Toil release has all of the above mentioned fixes merged

mr-c avatar May 25 '17 09:05 mr-c

Note: In cwltoil, sub-workflows must fully complete before any of their outputs are available for use by any other step/job. For example, the go_summary in the functional analysis (IPS) workflow isn't subject to further processing, but its production holds up the availability of the functional_annotations for futher processing by the parent workflow.

mr-c avatar May 27 '17 10:05 mr-c

To run the CWL conformance tests using cwltoil on LSF

virtualenv env
source env/bin/activate
pip install -U pip
pip install -U setuptools wheel
pip install .[cwl]
git clone https://github.com/common-workflow-language/common-workflow-language.git
cd common-workflow-language
pip install cwltest
TMP=$PWD ./run_test.sh RUNNER=toil-cwl-runner EXTRA="--batchSystem LSF --logDebug --logFile ${PWD}/log --disableCaching --user-space-docker-cmd=udocker" -j8"

(edited to use " double quote instead of single with EXTRA) (edited to set TMP to a path on the shared filesystem, needed for cwltest)

mr-c avatar May 27 '17 11:05 mr-c

Note for @mr-c : here's what I got from toil[cwl] running a workflow on a single machine, at some point I hear complaints about disk usage, although I never specified any requirements on that:

ripley 2017-06-08 18:50:26,180 Thread-82 WARNING toil.statsAndLogging: Got message from job at time 06-08-2017 18:50:26: Job used more disk than requested. Please reconsider modifying the user script to avoid the chance  of failure due to incorrectly requested resources. Job 'file:///home/hmenager/ReproHackathon/reprohackathon1/cwl/tools/fastq-dump.cwl' fastq-dump 8/A/job6HSIGE used 128.93% (2.6 GB [2768723968B] used, 2.0 GB [2147483648B] requested) at the end of its run.

The tool itself is defined there: https://github.com/IFB-ElixirFr/ReproHackathon/blob/cwl/reprohackathon1/cwl/tools/fastq-dump.cwl

hmenager avatar Jun 08 '17 16:06 hmenager

Priorities:

  • [x] correct unit detection https://github.com/BD2KGenomics/toil/issues/1691, see https://github.com/BD2KGenomics/toil/pull/1762
  • [x] move LSF to leverage the abstractGridEngineBatchSystem
  • [x] fix dynamic resource requirements: https://github.com/BD2KGenomics/toil/issues/1647

mr-c avatar Jul 18 '17 15:07 mr-c