dorado icon indicating copy to clipboard operation
dorado copied to clipboard

dorado correct produce empty file using PAF file

Open arslan9732 opened this issue 8 months ago • 3 comments

I ran Dorado correct using the PAF file. I created paf file using the following command based on the parameter described here.

/mnt/bin/minimap2/minimap2-2.28/bin/minimap2 -x ava-ont -k25 -w17 -I80G -r150,2000 --min-chain-score 4000 -z200,200 --cs=short --dual=yes --eqx -e 200 Reads.fq.gz Reads.fq.gz > overlaps.80G.paf

Then I run Dorado correct command:

/mnt/bin/dorado/dorado-0.8.3/bin/dorado correct -i 80G -m herro-v1 Reads.fq.gz --from-paf overlaps.80G.paf > corrected_reads.fasta

The output fasta file is empty. Here is the log file:

[2025-03-27 11:08:45.992] [info] Running: "correct" "-i" "80G" "-m" "herro-v1" "Reads.fq.gz" "--from-paf" "overlaps.80G.paf"
[2025-03-27 11:08:46.369] [info] Using batch size 28 on device cuda:0 in inference thread 0.
[2025-03-27 11:08:46.370] [info] Using batch size 28 on device cuda:0 in inference thread 1.
[2025-03-27 11:08:46.372] [info] Starting
[2025-03-27 16:04:37.990] [info] Finished

arslan9732 avatar Mar 27 '25 15:03 arslan9732

Hi @arslan9732,

How big is the PAF file you generated?

One thing I noticed in your workflow from above is that you're not sorting the PAF file by target name. Dorado Correct expects the overlaps to be grouped by the target for correction (as it loads all overlaps per target read to correct it), while Minimap2 outputs them grouped by query name. Here is a more detailed list of requirements for the PAF file where this is mentioned. https://github.com/nanoporetech/dorado/issues/851#issuecomment-2397935548

Try sorting the PAF file by target name and then rerunning dorado correct.

Additionally, the -i 80G will not be used when the overlap step is skipped in correct, so you can omit this.

svc-jstone avatar Apr 03 '25 07:04 svc-jstone

I see that my command -c was missing. After adding -c, it is working, but very slowly. The Fastq reads zipped file size is 64Gb (containing around 2.88 M reads), the PAF file it created is 5.9 Tb. The inference step has been running for 3 days, but it only produces 1.6 GB of corrected reads. How can I run it faster?

arslan9732 avatar Apr 03 '25 08:04 arslan9732

Glad to hear you figured it out!

You can try dorado v0.9.5 which was just released, it has significant performance improvements for both the inference and overlapping steps. If you run inference only from your PAF file, you should notice (hopefully) 30% faster runtime. The overlap stage itself is now 2x faster, but it is still a bottleneck so if you run the entire workflow in one go, the GPU will be underutilized (so you may perhaps just stick with using your precomputed PAF file).

Other than this, you can only add more GPU devices, up to the point until you hit the IO bottleneck 🙁 (All other time is spent in inference, so not much can be done about it at the moment.)

svc-jstone avatar Apr 03 '25 10:04 svc-jstone

I'm closing this issue because it appears to be resolved and is becoming stale. Please feel free to reopen it if the issue persists.

svc-jstone avatar Jun 09 '25 15:06 svc-jstone