parallel-fastq-dump
parallel-fastq-dump copied to clipboard
Benchmark comparison
Hi,
This is more for your information and not an issue.
I wanted to let you know about a comparison that I ran between parallel-fastq-dump
and sra-tools
prefetch
+ fasterq-dump
. You can find the code and results in this repo.
This is the way that I invoke parallel-fastq-dump
so if you see some problem, tweak, or think that it is an unfair comparison, please let me know.
hello, thanks for letting me know.
I'm don't use nextflow so excuse me if I misunderstood but, did you made sure that prefetch is also ran before parallel-fastq-dump ?
also, if you are running multiple prefetch/parallel-fastq-dump/fasterq-dump concurrently this could be affecting the results.
Happy to answer your questions:
- I ran every process sequentially in order to not affect bandwidth.
- In the plot in the readme are the times (durations) recorded of parallel-fastq-dump on one axis compared to the total duration (sum) of running prefetch + fasterq-dump on the other axis. Nextflow creates separate workspaces for each process so prefetch has no influence on parallel-fastq-dump.
fastq-dump (which actually does the work internally on parallel-fastq-dump) is not very good at downloading stuff, so it would be faster to run prefetch first and use parallel-fastq-dump just to dump the sra, but I guess it depends on how you want to compare, you could:
- run both parallel-fastq-dump and fasterq-dump without prefetch, to compare download+dumping times
- run both parallel-fastq-dump and fasterq-dump with prefetch first, to compare just the dumping times
its already expected that downloading with fastq-dump is worst than downloading with prefetch.
I see, I had understood parallel-fastq-dump as a "complete package" so I hadn't considered using prefetch first.
yes, and I bet many people run parallel-fastq-dump without realizing this (despite being explained on the readme), so I wouldn't say its an "unfair" comparison.
another interesting point is using --gzip
or --bzip2
on parallel-fastq-dump, for big SRAs the size difference between compressed and uncompressed fastq can be very big, so writing compressed files could finish faster (maybe ? idk), the problem is that fasterq-dump doesn't support writing compressed files so you cant compare 1:1
another interesting point is using
--gzip
or--bzip2
on parallel-fastq-dump, for big SRAs the size difference between compressed and uncompressed fastq can be very big, so writing compressed files could finish faster (maybe ? idk), the problem is that fasterq-dump doesn't support writing compressed files so you cant compare 1:1
I think this will be a next step. My plan is to compare the speed of compressed output from parallel-fastq-dump with running fasterq-dump + pigz for parallel compression. Do you know what the compression level is in fastq-dump --gzip
?
Do you know what the compression level is in fastq-dump --gzip?
no idea, but I guess leaving pigz on the default would be reasonable.
Alright, I've made a new benchmark first using prefetch
for all tools and also including compression. I have summarized the results in the readme https://github.com/UnseenBio/sra-demo-benchmark/tree/benchmark-prefetch
very interesting, a few comments in no order in particular:
- maybe the variance of cpu/mem can be explained by the sra size
- maybe you can repeat each sra 3 times and take the mean/median to account for weird stuff
- does this trend maintains if you use 8 threads ? 16 threads ?
- would parallel-fastq-dump+pigz be faster ?
dont take this as criticism, its just things that I thought while looking at the plots.
I think you make very valid points. I'm currently limited to one desktop computer, though 🙂 So the combinatorial increase in jobs and threads is a bit much for that. If you have more resources available, I'm happy to adjust the pipeline accordingly so that you can run it with one command.
maybe the variance of cpu/mem can be explained by the sra size
This is actually a very minor factor to me. I only made those plots since the data was there 😉 Yes, the distributions are almost bimodal because of the small and large sequences. A different set of input IDs that has a better representation over a large range could be nice indeed.
would parallel-fastq-dump+pigz be faster ?
Possibly, I guess ideal scenario would be for each process/thread to write compressed output directly in a compatible way so that everything can be stitched together. Or maybe that is how it currently happens? I also still don't know the compression level of fastq-dump --gzip
so it might be higher than pigz
's default.
@wraetz (I assume from NCBI) also talked about possible differences depending on the alignment of the SRA files. Although I don't have a relevant set of IDs to test that yet.
yeah I understand adding more comparisons increases the load significantly, I was just wondering really.
Possibly, I guess ideal scenario would be for each process/thread to write compressed output directly in a compatible way so that everything can be stitched together. Or maybe that is how it currently happens?
I'm not sure what you're getting at but, each fastq-dump --gzip
writes a separate gzip file concurrently, all parallel-fastq-dump does is cat
the compressed files in the correct order in the end, yes it works as expected.
what I was suggesting tho, is to test your hyphotesis that fasterq-dump+pigz
is faster due to pigz, you could run parallel-fastq-dump
uncompressed + pigz.