bsub template many jobs in single command.

We often have a list of files and we want to the same thing to all of them. If we know the length, we can use an array, but what if it could be like:


    inputs = "/path/to/*.sam"
    command = "samtools view -bS {input} > results/{basename}.bam

where we automatically define things like basename, dirname, date, wd, input etc. based on each item in input.

then, could specify a name that is either a function is called on each input and which returns a name. [ default is os.path.basename]

def name(input):
    return input.split("/")[-1].split(".")[1]

or a string that's a regular expression performed on the input that captures the name:

    name="info.(sample_\d+).sam"

Where the capture from (sample_\d+) would become available as name. Then they could use "{name}" in their template. And they could also use named captures in the regexp which could then bed used in their format string (http://docs.python.org/2/library/re.html#re.MatchObject.groupdict)

I think this would clean up a lot of my pipelines. so then it would b like:

    from bsub import bsub
    job_ids, names = bsub.template(command, inputs, name_getter=name)

If command is actually a list:

    commands = ["samtools view -bS {input} > results/{basename}.bam",
                         "samtools index results/{basename}.bam"]

     bsub.template(commands, inputs)

and for each input, the runs job(commands[0]).then(commands[1]) ... and that can be as many commands as needed.

This would be nicer if we had better support for checking successful completion of a job inside of .then() by polling and checking the log file for successful termination instead of relying on LSF's wait().

As written, template would be a classmethod. It could also be called "multisub".

Mar 21 '14 17:03 brentp

A complete example (working in the template branch) would look like:

import re
from bsub import bsub

commands = 'bwameth.py --reference {ref} --read-group {sample} -p {results}/{name} {r1} {r2}'
inputs = "/proj/brentp/2014/ken-target-bis-seq/*/*_R1_*.fastq.gz"

def namer(fq):
    info = re.match(r".+/(?P<name>(?P<sample>.+)_[ATCG]+_L\d+)_R1_[01]{3}", fq).groupdict()
    info['r1'] = fq
    info['r2'] = info['r1'].replace('_R1_', '_R2_')
    return info

bsub.template(commands, inputs, verbose=666, name_getter=namer,
        info_dict=dict(ref="/path/to/ref.fa", results="/results/"), n=12, R="rusage[mem=100]")

where here, I have to use name_getter as a function because I need to get R2

where the extra kwargs like n and R get sent straight to bsub

Mar 21 '14 23:03 brentp

As of e438378e8b7b3ac0c78783f1ece0f73ecd the above can be shortened to:

import re
from bsub import bsub

commands = 'bwameth.py --reference {ref} --read-group {sample} -p {results}/{name} {r1} {r2}'
inputs = "/proj/brentp/2014/ken-target-bis-seq/*/*_R1_*.fastq.gz"

namer = r".+/(?P<name>(?P<sample>.+)_[ATCG]+_L\d+)_R1_[01]{3}"

bsub.template(commands, inputs, verbose=666, name_getter=namer,
        info_dict=dict(ref="/path/to/ref.fa", results="/results/"), n=12, R="rusage[mem=100]")

by seeing if the first file is a fastq and if s/R1/R2/ of that exists. So r1, r2, fq1, fq2 are autofilled and available as template variables.

Mar 23 '14 13:03 brentp

bsub bsub copied to clipboard

template many jobs in single command.

bsub
bsub copied to clipboard