dorado
dorado copied to clipboard
dorado correct produce empty file using PAF file
I ran Dorado correct using the PAF file. I created paf file using the following command based on the parameter described here.
/mnt/bin/minimap2/minimap2-2.28/bin/minimap2 -x ava-ont -k25 -w17 -I80G -r150,2000 --min-chain-score 4000 -z200,200 --cs=short --dual=yes --eqx -e 200 Reads.fq.gz Reads.fq.gz > overlaps.80G.paf
Then I run Dorado correct command:
/mnt/bin/dorado/dorado-0.8.3/bin/dorado correct -i 80G -m herro-v1 Reads.fq.gz --from-paf overlaps.80G.paf > corrected_reads.fasta
The output fasta file is empty. Here is the log file:
[2025-03-27 11:08:45.992] [info] Running: "correct" "-i" "80G" "-m" "herro-v1" "Reads.fq.gz" "--from-paf" "overlaps.80G.paf"
[2025-03-27 11:08:46.369] [info] Using batch size 28 on device cuda:0 in inference thread 0.
[2025-03-27 11:08:46.370] [info] Using batch size 28 on device cuda:0 in inference thread 1.
[2025-03-27 11:08:46.372] [info] Starting
[2025-03-27 16:04:37.990] [info] Finished
Hi @arslan9732,
How big is the PAF file you generated?
One thing I noticed in your workflow from above is that you're not sorting the PAF file by target name. Dorado Correct expects the overlaps to be grouped by the target for correction (as it loads all overlaps per target read to correct it), while Minimap2 outputs them grouped by query name. Here is a more detailed list of requirements for the PAF file where this is mentioned. https://github.com/nanoporetech/dorado/issues/851#issuecomment-2397935548
Try sorting the PAF file by target name and then rerunning dorado correct.
Additionally, the -i 80G will not be used when the overlap step is skipped in correct, so you can omit this.
I see that my command -c was missing. After adding -c, it is working, but very slowly.
The Fastq reads zipped file size is 64Gb (containing around 2.88 M reads), the PAF file it created is 5.9 Tb. The inference step has been running for 3 days, but it only produces 1.6 GB of corrected reads.
How can I run it faster?
Glad to hear you figured it out!
You can try dorado v0.9.5 which was just released, it has significant performance improvements for both the inference and overlapping steps.
If you run inference only from your PAF file, you should notice (hopefully) 30% faster runtime.
The overlap stage itself is now 2x faster, but it is still a bottleneck so if you run the entire workflow in one go, the GPU will be underutilized (so you may perhaps just stick with using your precomputed PAF file).
Other than this, you can only add more GPU devices, up to the point until you hit the IO bottleneck 🙁 (All other time is spent in inference, so not much can be done about it at the moment.)
I'm closing this issue because it appears to be resolved and is becoming stale. Please feel free to reopen it if the issue persists.