Flye
Flye copied to clipboard
How to solve the problem about the high memory minimap2
Hello, author, thank you for developing the useful tools. I have successfully run it in another small genome of about 230MB, and got a very good assembly result.However, when I was running on my other big genome, an error occurred.
The estimated size of my big genome is about 5G, and I have 87GB hifi data to assemble it. In the first step 00-assembly/, Iit produced 7GB draft_assembly.fasta。In the second step 10-consensus Error occurred.
nohup flye --pacbio-hifi ../../../00.data/hifi.fasta -o ./ --threads 20 --genome-size 5G --asm-coverage 40 --resume &>flye.log&
flye.log
[2022-07-14 05:16:37] INFO: Contained seqs: 14575 [2022-07-14 05:16:42] DEBUG: Writing FASTA [2022-07-14 05:17:33] DEBUG: Peak RAM usage: 398 Gb ####first step -----------End assembly log------------ [2022-07-14 05:19:09] root: DEBUG: Disjointigs length: 8299546082, N50: 1166749 [2022-07-14 05:19:09] root: INFO: >>>STAGE: consensus [2022-07-14 05:19:09] root: INFO: Running Minimap2 ERROR: Error running minimap2, terminating. See the alignment error log for details:flye/10-consensus/minimap.stderr [2022-07-14 05:47:01] root: ERROR: Command '['/bin/bash', '-c', "set -eo pipefail; flye-minimap2 '/flye/00-assembly/draft_assembly.fasta' 'hifi.fasta' -x map-pb -t 60 -a -p 0.5 -N 10 --sam-hit-only -L -K 5G -z 1000 -Q --secondary-seq -I 64G | flye-samtools view -T 'flye/00-assembly/draft_assembly.fasta' -u - | flye-samtools sort -T 'flye/10-consensus/sort_220714_051909' -O bam -@ 4 -l 1 -m 4G -o 'flye/10-consensus/minimap.bam'"]' returned non-zero exit status 1. [2022-07-14 05:47:01] root: ERROR: Pipeline aborted
minimap.stderr [samfaipath] build FASTA index... [M::mm_idx_gen::122.0491.42] collected minimizers [M::mm_idx_gen::127.6242.63] sorted minimizers [M::main::127.6302.63] loaded/built the index for 42078 target sequence(s) [M::mm_mapopt_update::130.3622.59] mid_occ = 2601 [M::mm_idx_stat] kmer size: 19; skip: 10; is_hpc: 1; #seq: 42078 [M::mm_idx_stat::132.1962.57] distinct minimizers: 117092740 (22.02% are single tons); average occurrences: 9.117; average spacing: 7.775; total length: 8299546 082 [M::worker_pipeline::849.49446.22] mapped 293996 sequences [W::sam_read1] Parse error at line 665799 [main_samview] truncated file. [bam_sort_core] merging from 0 files and 4 in-memory blocks...
The same problem happened. RAM=629MB is the largest memory I can borrow from my lab. I watched the memory go up and reported an error. My data set, has been successfully run on such as the hifiasm and other software using minimap2.
I wonder if there are any other parameters here about minimap2 or flye for me to adjust. Could you give me some suggestions?
THANKS VERY MUCH! THE BEST TO YOU !!!
Hello,
I am also getting an error when running the minimap2 portion of the pipeline. Have you resolved this issue?
@zhangwenda0518 my only suggestion would be to try less threads for the consensus step (your log indicates you used 64 threads, I'd try 20 or 10). If that helps and the pipeline goes through, you can then stop and resume next steps with larger number of theads. It is likely that you genome has a lot of duplication (or heterozygosity) resolved in the assmebly, which makes it challenging for minimap2 to align.
Mikhail
@fenderglass Hi, I am using Flye for assembly of a teleost genome (~4.5Gb). The initial assembly step is working fine but it is failing at the consensus step. I have posted the error from the general Flye.log and the minimap.stderr files below. Is there a way to either:
- manually run the consensus steps (e.g. I can run flye-minimap2 '/projects/scratch/coral_omics/flye_assembly/00-assembly/draft_assembly.fasta' '/projects/scratch/coral_omics/raw_reads/all.hifi_reads.fastq' -x map-pb -t 10 -a -p 0.5 -N 10 --sam-hit-only -L -K 5G -z 1000 -Q --secondary-seq -I 64G | flye-samtools view -T '/projects/scratch/coral_omics/flye_assembly/00-assembly/draft_assembly.fasta' -u - | flye-samtools sort -T '/projects/scratch/coral_omics/flye_assembly/10-consensus/sort_220825_211310' -O bam -@ 4 -l 1 -m 4G -o '/projects/scratch/coral_omics/flye_assembly/10-consensus/minimap.bam' on its own without error if I increase the -m flag to 16G instead of 4G) Unfortunately, after generating the minimap.bam there isn't a way to resume the consensus step without it trying to repeat the mapping, which consistently fails. I am not sure what the remaining code would be to generate the consensus.fasta file if I wanted to execute all the commands in the consensus step manually and then resume Flye at the repeat step. OR
- Can the value used for the -m flag be changed in the Flye command using the --extra-params flag? It is unclear from the documentation what configuration parameters are editable in this manner or what the syntax would be.
Best, Melissa
Flye general log error
[2022-08-25 21:13:10] root: INFO: >>>STAGE: consensus [2022-08-25 21:13:10] root: INFO: Running Minimap2 [2022-08-25 22:14:32] root: ERROR: Error running minimap2, terminating. See the alignment error log for details: /projects/scratch/coral_omics/flye_assembly/10-consensus/minimap.stderr [2022-08-25 22:14:32] root: ERROR: Command '['/bin/bash', '-c', "set -eo pipefail; flye-minimap2 '/projects/scratch/coral_omics/flye_assembly/00-assembly/draft_assembly.fasta' '/projects/scratch/coral_omics/raw_reads/all.hifi_reads.fastq' -x map-pb -t 10 -a -p 0.5 -N 10 --sam-hit-only -L -K 5G -z 1000 -Q --secondary-seq -I 64G | flye-samtools view -T '/projects/scratch/coral_omics/flye_assembly/00-assembly/draft_assembly.fasta' -u - | flye-samtools sort -T '/projects/scratch/coral_omics/flye_assembly/10-consensus/sort_220825_211310' -O bam -@ 4 -l 1 -m 4G -o '/projects/scratch/coral_omics/flye_assembly/10-consensus/minimap.bam'"]' returned non-zero exit status 1. [2022-08-25 22:14:32] root: ERROR: Pipeline aborted
Consensus minimap2 error (from minimap.stderr)
[samfaipath] build FASTA index... [M::mm_idx_gen::94.7641.65] collected minimizers [M::mm_idx_gen::101.5342.19] sorted minimizers [M::main::101.5382.19] loaded/built the index for 19842 target sequence(s) [M::mm_mapopt_update::103.9232.16] mid_occ = 1374 [M::mm_idx_stat] kmer size: 19; skip: 10; is_hpc: 1; #seq: 19842 [M::mm_idx_stat::104.893*2.15] distinct minimizers: 84465886 (16.45% are singletons); average occurrences: 7.062; average spacing: 7.513; total length: 4481605804 samtools sort: couldn't allocate memory for bam_mem
@mxd1288 please submit the full flye.log
file. Also, are you saying that manually running the minimap2 command line worked for you? It seemed to run out of memory originally, so maybe your machine had less memory available at the time of Flye run. In this case, re-trying form the consensus step should work.
@fenderglass Sorry, the full flye.log is now attached. I should have also clarified I am running Flye on my university HPC cluster, which uses an LSF scheduler. When I run Flye the attached log is produced with an error at the consensus step. flye.log
Manually running the minimap2 command worked in a script submitted to the LSF scheduler with the -m flag edited as such: flye-minimap2 draft_assembly.fasta all.hifi_reads.fastq -x map-pb -t 10 -a -p 0.5 -N 10 --sam-hit-only -L -K 5G -z 1000 -Q --secondary-seq -I 64G | flye-samtools view -T draft_assembly.fasta -u - | flye-samtools sort -T sort_220819_235712 -O bam -@ 4 -l 1 -m 16G -o minimap.bam
However, when I try to resume Flye from consensus it restarts from the beginning of consensus rather than starting from the manually produced minimap.bam file. The LSF scheduler also produces a log file, which has a positive delta memory, so the problem is not that I do not have enough total memory. The problem seems to be that the LSF scheduler and/or the default flye parameters for minimap2 do not allocate enough memory to the samtools sort step.
Sorry for my late response!
-
How much RAM does the machine your are using to run minimap2 manually have? It is a bit strange that you can process it manually with even higher buffer size for samtools, but the scheduler run fails. The memory measurements of the scheduled runs may not always be accurate, sometime it could be an underestimate.
-
Consensus step consists of two parts: minimap2 alignment and consensus calling itself. Perhaps you could run the entire stage locally by adding
--resume-from consensus
--stop-after consensus
, and then run the scheduler job with--resume-from repeat
. -
Another possibility could be that LSF run is crushing because of extra
samtools
threads, maybe you can try to increase the number of requested CPUs (by 6), while keeping the same-t
Flye argument -
If you can edit the Flye code, you can change
SORT_THREADS
to 4, andSORT_MEM
into500M
here: https://github.com/fenderglass/Flye/blob/39eb9acff398abf48c33d02ce6bfd6d6af81f8f1/flye/polishing/alignment.py#L242
Sorry for the trouble, hope this helps, and I can incorporate the changes into the new Flye versions.
@fenderglass I met a same problem and I found that flye-minimap2
doesn't have -x hifi
. Although when I use the command flye --pacbio-hifi ccs.fastq.gz -o ./ -g 7.9g -t 50 --resume-from consensus
, the flye-minimap2 still use the parameter -x map-pb
and consumed large memory resouces. So I try to use minimap2 instead of flye-minimap2 to generate minimap.bam file. The command is here: minimap2 -x map-hifi -t 60 -a -p 0.5 -N 10 --sam-hit-only -L -K 5G -z 1000 -Q --secondary-seq -I 64G ./00-assembly/draft_assembly.fasta WN_YCY.ccs.fastq.gz | samtools sort -T -O bam -@ 60 -l 1 -m 4G -o ./10-consensus/minimap.bam
. But I don't know if this alternative choice ( using minimap2 instead of flye-minimap2 ) is correct ? And if the method is appropriate, I want to know how to generate consensus.fa from the minimap.bam by using flye? .
THANKS VERY MUCH! THE BEST TO YOU !!!
Flye is using a specialized version of minimap2, the current versions was forked from an earlier version of minimap2 and does not support map-hifi. We're planning to to update minimap2 base in the next release.