dorado
dorado copied to clipboard
memory (RAM) usage for dorado correct
Issue Report
Please describe the issue:
I try to use dorado correct, but never get any success due to RAM shortage. The process starts and delivered with the available resources (see below) 436Mb of output before OOM-kill 289G Nov 14 14:03 CornBorer_reads.fastq 684M Nov 15 04:05 CornBorer_reads.fastq.fai 436M Nov 18 13:49 corrected_CornBorer_reads.fastq This (289G) is genomic data.
Steps to reproduce the issue:
#SBACTH -c 24
#SBATCH --gres=gpu:2
#SBATCH --time=72:0:0
#SBATCH --mem=420G
$bin/dorado-0.8.3-linux-x64/bin/dorado correct --verbose --threads 24 --index-size 4G --batch-size 16 --device cuda:all $FQ > corrected_${FA}
Run environment:
- Dorado version: 0.8.3
- Dorado command: dorado correct --verbose --threads 24 --index-size 4G --batch-size 16 --device cuda:all $FQ
- Operating system:RHEL8
- Hardware (CPUs, Memory, GPUs): 2x 24-core AMD EPYC 7413 (Milan @ 2.2 GHz); 500Gb RAM; 4x NVIDIA Ampere A100 GPUs (80GB GPU memory)
- Source data type (e.g., pod5 or fast5 - please note we always recommend converting to pod5 for optimal basecalling performance): fastq
- Source data location (on device or networked drive - NFS, etc.): Network share (HDR InfiniBand)
- Details about data (flow cell, kit, read lengths, number of reads, total dataset size in MB/GB/TB):
- Dataset to reproduce, if applicable (small subset of data to share as a pod5 to reproduce the issue):
Logs
[vsc40014@gligar08 CORN]$ head -n50 dorado_correct_15414825.err
[2024-11-18 09:43:11.218] [info] Running: "correct" "--verbose" "--threads" "24" "--index-size" "4G" "--batch-size" "16" "--device" "cuda:all" "CornBorer_reads.fastq"
[2024-11-18 09:43:11.550] [debug] Aligner threads 24, corrector threads 6, writer threads 1
[2024-11-18 09:43:11.561] [warning] Unknown certs location for current distribution. If you hit download issues, use the envvar `SSL_CERT_FILE` to specify the location manually.
[2024-11-18 09:43:11.564] [info] - downloading herro-v1 with httplib
[2024-11-18 09:43:11.640] [error] Failed to download herro-v1: SSL server verification failed
[2024-11-18 09:43:11.640] [info] - downloading herro-v1 with curl
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 22.3M 100 22.3M 0 0 54.8M 0 --:--:-- --:--:-- --:--:-- 54.8M
[2024-11-18 09:43:12.217] [debug] furthest_skip_header = '', furthest_skip_id = -1
[2024-11-18 09:43:12.348] [info] Using batch size 16 on device cuda:0 in inference thread 0.
[2024-11-18 09:43:12.348] [info] Using batch size 16 on device cuda:0 in inference thread 1.
[2024-11-18 09:43:12.348] [info] Using batch size 16 on device cuda:1 in inference thread 0.
[2024-11-18 09:43:12.348] [info] Using batch size 16 on device cuda:1 in inference thread 1.
[2024-11-18 09:43:12.349] [debug] Starting process thread for cuda:0!
[2024-11-18 09:43:12.349] [debug] Starting process thread for cuda:0!
[2024-11-18 09:43:12.349] [debug] Starting process thread for cuda:1!
[2024-11-18 09:43:12.350] [debug] Starting process thread for cuda:1!
[2024-11-18 09:43:12.350] [debug] Starting decode thread!
[2024-11-18 09:43:12.351] [debug] Starting decode thread!
[2024-11-18 09:43:12.351] [debug] Looking for idx CornBorer_reads.fastq.fai
[2024-11-18 09:43:12.351] [debug] Starting decode thread!
[2024-11-18 09:43:12.351] [debug] Starting decode thread!
[2024-11-18 09:43:12.352] [debug] Initialized index options.
[2024-11-18 09:43:12.352] [debug] Loading index...
[2024-11-18 09:43:12.733] [debug] Loading model on cuda:1...
[2024-11-18 09:43:12.733] [debug] Loading model on cuda:1...
[2024-11-18 09:43:12.744] [debug] Loading model on cuda:0...
[2024-11-18 09:43:12.744] [debug] Loading model on cuda:0...
[2024-11-18 09:43:12.996] [debug] Loaded model on cuda:0!
[2024-11-18 09:43:12.996] [debug] Loaded model on cuda:1!
[2024-11-18 09:43:12.997] [debug] Loaded model on cuda:1!
[2024-11-18 09:43:12.997] [debug] Loaded model on cuda:0!
[2024-11-18 09:43:54.665] [debug] Loaded index with 240571 target seqs
[2024-11-18 09:43:56.613] [debug] Loaded mm2 index.
[2024-11-18 09:43:56.614] [info] Starting
[2024-11-18 09:43:56.614] [debug] Align with index 0
[2024-11-18 09:43:57.829] [debug] Read 10000 reads
[2024-11-18 09:44:01.887] [debug] Alignments processed 10000, total m_corrected_records size 130.63971 MB
[2024-11-18 09:44:05.846] [debug] Read 20000 reads
[2024-11-18 09:44:10.193] [debug] Alignments processed 20000, total m_corrected_records size 353.4832 MB
[2024-11-18 09:44:14.394] [debug] Read 30000 reads
[2024-11-18 09:44:18.944] [debug] Alignments processed 30001, total m_corrected_records size 577.7794 MB
[2024-11-18 09:44:22.934] [debug] Read 40000 reads
[2024-11-18 09:44:27.507] [debug] Alignments processed 40001, total m_corrected_records size 796.8967 MB
[2024-11-18 09:44:31.691] [debug] Read 50000 reads
[2024-11-18 09:44:36.586] [debug] Alignments processed 50000, total m_corrected_records size 1026.6725 MB
[2024-11-18 09:44:41.385] [debug] Read 60000 reads
[2024-11-18 09:44:44.616] [debug] Alignments processed 60007, total m_corrected_records size 1217.6389 MB
[2024-11-18 09:44:47.573] [debug] Read 70000 reads
...
[2024-11-18 13:49:08.992] [debug] Alignments processed 7920000, total m_corrected_records size 165349.5 MB
[2024-11-18 13:49:14.332] [debug] Read 7930000 reads
[2024-11-18 13:49:19.343] [debug] Alignments processed 7930001, total m_corrected_records size 165589.53 MB
[2024-11-18 13:49:25.549] [debug] Read 7940000 reads
[2024-11-18 13:49:33.678] [debug] Alignments processed 7940000, total m_corrected_records size 165829.38 MB
[2024-11-18 13:49:39.586] [debug] Read 7950000 reads
[2024-11-18 13:49:45.336] [debug] Alignments processed 7950002, total m_corrected_records size 166068.16 MB
[2024-11-18 13:49:49.432] [debug] Read 7960000 reads
[2024-11-18 13:49:54.683] [debug] Alignments processed 7960003, total m_corrected_records size 166287.48 MB
[2024-11-18 13:49:59.829] [debug] Read 7970000 reads
[2024-11-18 13:50:05.495] [debug] Alignments processed 7970000, total m_corrected_records size 166535.27 MB
[2024-11-18 13:50:10.317] [debug] Read 7980000 reads
[2024-11-18 13:50:15.363] [debug] Alignments processed 7980000, total m_corrected_records size 166763.31 MB
[2024-11-18 13:50:19.987] [debug] Read 7990000 reads
[2024-11-18 13:50:25.591] [debug] Alignments processed 7990020, total m_corrected_records size 166999.25 MB
[2024-11-18 13:50:30.512] [debug] Read 8000000 reads
[2024-11-18 13:50:35.650] [debug] Alignments processed 8000000, total m_corrected_records size 167237.64 MB
[2024-11-18 13:50:40.581] [debug] Read 8010000 reads
/var/spool/slurm/slurmd/job15414825/slurm_script: line 27: 15407 Killed $bin/dorado-0.8.3-linux-x64/bin/dorado correct --verbose --threads 24 --index-size 4G --batch-size 16 --device cuda:all $FQ > corrected_${FQ}
slurmstepd: error: Detected 1 oom_kill event in StepId=15414825.batch. Some of the step tasks have been OOM Killed.
Hi @stephrom, Is your input dataset of a very high depth?
Best regards, Rich
Hi,
not particularly,
it is genomic data from a corn borer.
best Stephane
Stephane Rombauts Principal Scientific staff
Bioinformatics & Systems Biology Division Tel:+32 tel:+32 (0)9 331 38 21 fax:+32 (0)9 3313809 VIB-UGent Center for Plant Systems Biology, Ghent University Technologiepark 71, 9052 Gent, BELGIUM @.*** @.***> http://bioinformatics.psb.ugent.be/
On 19 Nov 2024, at 14:21, Richard Harris @.***> wrote:
Warning: This email originated from outside of PSB. Do not click on links or open attachments unless you are certain that this email is safe.
Hi @stephrom https://github.com/stephrom, Is your input dataset of a very high depth?
Best regards, Rich
— Reply to this email directly, view it on GitHub https://github.com/nanoporetech/dorado/issues/1137#issuecomment-2485699919, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEBCVA43H35OTSWYMFZG5CT2BM3ORAVCNFSM6AAAAABR7VWUX2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOBVGY4TSOJRHE. You are receiving this because you were mentioned.
Hi @stephrom ,
Apologies for a late response - have you managed to get around your issue in the meantime?
If the Corn Borer genome is roughly 500Mbp in size, then your input coverage is about ~289x on average (based on the stats you provided). Even though you have a high-mem machine, it is possible that the high coverage + repetitive regions blow up memory in some cases during the overlap step.
Here are some suggestions:
- Can you try out the new Dorado v0.9.5? It now has new optimizations in both the overlap and the correction steps, hopefully it should be less memory-hungry.
- If the issue still persists, try to reduce
--index-sizefurther and see if that works to reduce the amount of overlaps collected in memory at any time. - Try running the overlap step (
--to-paf) and correction steps separately (using--from-paf).
In case you resolved the issue, would you mind describing how and closing this ticket?
Hi @svc-jstone,
I gave up on correcting the data with Dorado, assembled with uncorrected reads using FLYE, and corrected the assembly a posteriori with Medaka/RACON.
This genome is ~1.5G
just an idea: to avoid this, would it not be possible to run corrections incrementally by adding chunks of uncorrected reads to already corrected reads, making these better, or by adding new reads coming from the added chunk? Starting with the longest reads from the set.
best Stephane
just an idea: to avoid this, would it not be possible to run corrections incrementally by adding chunks of uncorrected reads to already corrected reads, making these better, or by adding new reads coming from the added chunk? Starting with the longest reads from the set.
Not sure if I'm following correctly based on the context of this thread, but to produce the corrections we need all possible overlaps with a target read at the moment. Ultimately, only 30x of best alignments are chosen for a window, but the 30x best reads in window W1 might not be the best ones in window W2 so we need to keep the entire pile of overlaps for correction for an entire target read. We could potentially introduce a heuristic to limit the maximum total loaded number of overlaps to prevent excessive memory usage in case of extremely repetitive regions. Once all windows for a target read are corrected, they are concatenated together post correction. If there was a gap in coverage, the target read will be broken up in two (or more) pieces.
In what you described, it sounds like you're proposing an incremental (or iterative) correction where you would pick N first alignments and produce a corrected target read, then take the next N alignments and produce the correction, etc.
This could work, but the impact of this approach on accuracy can't be determined without an implementation, and an implementation of this would be a big effort and make dorado correct codebase much more complicated, so at the moment we won't be implementing it without data da this would be beneficial and not reduce the accuracy.
Closing as stale. We believe memory usage has been reduced since 0.9.5 - if you continue to see issues please reopen with additional information.