Badread icon indicating copy to clipboard operation
Badread copied to clipboard

Multi-threading

Open baraaorabi opened this issue 5 years ago • 4 comments

Is your feature request related to a problem? Please describe.

The simulator is very slow when it comes to

  • Adjusting the lengths of reference contigs. This might be not an issue for human genomes, but it is a big issue for the transcriptome (>180K vs. 23)
  • Generating reads.

Both of these steps should have straighforward data parallelism

Describe the solution you'd like Multithreading of the two steps (and possible others?)

Describe alternatives you've considered Adding a program command to prepare the reference contigs and pickle the results so rerunning won't be slow. That won't really resolve the read generation speed thu

Additional context I am building a wrapper around Badread for transcriptomic reads. It's still in the design stage. I plan to code the multithreading described above on a separate branch and make PR

baraaorabi avatar Sep 26 '19 23:09 baraaorabi

For others stumbling across this issue, here's a little snakemake template that mimmicks multi-threading by running badread multiple times and concatenating the fastq files at the end.

threads = list(range(10))
genome = "genome.fa"

# example to pass through parameters
rlen_mean = 15000
rlen_sd = 13000
sim_params = {"rlen": f"{rlen_mean},{rlen_sd}"}

rule all:
    input:
        expand("reads_{t}.fq", t=threads),
        "sim_reads.fq"

# run badread simulate multiple times on the same input genome
rule badread_sim:
    input: genome
    output: "reads_{t}.fq"
    params:
        rlen = lambda wildcards: sim_params['rlen']
    shell:
        "badread simulate --reference {input} --length {params.rlen} >{output}"

# afterwards simply concatenate all output read files
rule concat_sim:
    input: expand("reads_{t}.fq",t=threads)
    output: "sim_reads.fq"
    shell:
        "cat {input} > {output}"

Just saw that there is already a wiki entry for doing exactly the same thing in bash. Anyway, maybe this is still useful for someone.

W-L avatar Jul 02 '21 13:07 W-L

Before anyone do the same thing that I did and follow blindly W-L's answer, note that doing so will in some occasion generate the same read name multiple times. This might affect your pipeline, especially if you're cleaning your reads later since minimap2 do not care if multiple reads with the same name appear, and will just map them individually, leading to secondary / chimeric alignments.

jsgounot avatar Mar 25 '22 06:03 jsgounot

@jsgounot did you get the same read name multiple times? If so, you should buy a lottery ticket as the read names are generated with uuid https://github.com/rrwick/Badread/blob/09fb3082e5b2530c4e17e20e262ff227eb28ff13/badread/simulate.py#L77

mbhall88 avatar Jun 14 '23 05:06 mbhall88

I know but I'm not as lucky with the lottery sadly ...

jsgounot avatar Jun 14 '23 09:06 jsgounot