Badread
Badread copied to clipboard
Multi-threading
Is your feature request related to a problem? Please describe.
The simulator is very slow when it comes to
- Adjusting the lengths of reference contigs. This might be not an issue for human genomes, but it is a big issue for the transcriptome (>180K vs. 23)
- Generating reads.
Both of these steps should have straighforward data parallelism
Describe the solution you'd like Multithreading of the two steps (and possible others?)
Describe alternatives you've considered Adding a program command to prepare the reference contigs and pickle the results so rerunning won't be slow. That won't really resolve the read generation speed thu
Additional context I am building a wrapper around Badread for transcriptomic reads. It's still in the design stage. I plan to code the multithreading described above on a separate branch and make PR
For others stumbling across this issue, here's a little snakemake template that mimmicks multi-threading by running badread multiple times and concatenating the fastq files at the end.
threads = list(range(10))
genome = "genome.fa"
# example to pass through parameters
rlen_mean = 15000
rlen_sd = 13000
sim_params = {"rlen": f"{rlen_mean},{rlen_sd}"}
rule all:
input:
expand("reads_{t}.fq", t=threads),
"sim_reads.fq"
# run badread simulate multiple times on the same input genome
rule badread_sim:
input: genome
output: "reads_{t}.fq"
params:
rlen = lambda wildcards: sim_params['rlen']
shell:
"badread simulate --reference {input} --length {params.rlen} >{output}"
# afterwards simply concatenate all output read files
rule concat_sim:
input: expand("reads_{t}.fq",t=threads)
output: "sim_reads.fq"
shell:
"cat {input} > {output}"
Just saw that there is already a wiki entry for doing exactly the same thing in bash. Anyway, maybe this is still useful for someone.
Before anyone do the same thing that I did and follow blindly W-L's answer, note that doing so will in some occasion generate the same read name multiple times. This might affect your pipeline, especially if you're cleaning your reads later since minimap2 do not care if multiple reads with the same name appear, and will just map them individually, leading to secondary / chimeric alignments.
@jsgounot did you get the same read name multiple times? If so, you should buy a lottery ticket as the read names are generated with uuid
https://github.com/rrwick/Badread/blob/09fb3082e5b2530c4e17e20e262ff227eb28ff13/badread/simulate.py#L77
I know but I'm not as lucky with the lottery sadly ...