hifiasm
hifiasm copied to clipboard
hifiasm needing too much memory
Hello,
I have a problem similar to issue 222, but actually it is the opposite.
I am assembling a plant genome from two SMRT cells of HiFi data. The k-mer distribution shows a peak close to 90x, and I am assuming there is a heterozygous peak around 44x but it is buried among low frequency k-mers:
In total, there are more than 38 Gb of data. With the homozygous peak at ~87x, then the genome should be approximately 438 Mb - just to have a rough estimation.
When I run both cells with Hifiasm
~/bin/hifiasm_0.16.1-r375/hifiasm -t 200 -o Cgil_hifasm_l3 -l3 cell1.hifi_reads.fastq.gz cell2.hifi_reads.fastq.gz 2>Cgil_hifasm_l3_stdout
the job gets killed by using more than the 1 TB memory that is available. This is the log file:
Cgil_hifasm_l3_stdout.txt
I run one cell, and it completed, though using almost 800 GB of memory. With other plant genomes, even much larger, I never had this problem of too little memory. The assembly looks very fragmented: total size of 835 and 734 Mb, 12,000 and 9,000 contigs, N50 of 82 kb and 100 kb for hap1 and hap2, respectively. I am now running the second cell by itself. Are there some parameters that could be tweaked to reduce the memory needed?
summarizing: From the k-mer curve, it looks like I have >40x per allele (too much coverage?): can that may be a reason for the high memory demand? But assembling half of the data only, I get a much larger assembly with very short contiguity. So it could actually be too little coverage - odd. I am even wondering if the high amount of k-mers below 60x could be contamination of some sort. Can you help me figure out what is happening? Thanks, Dario
See the FAQ here: https://hifiasm.readthedocs.io/en/latest/faq.html#why-does-hifiasm-stuck-or-crash. And an example: https://github.com/chhylp123/hifiasm/issues/93#issuecomment-863916776. Probably no enough coverage or containment.
OK, but then how do I solve the memory issue? I don't think it is normal that a 800 Mb genome should need more than 1 TB memory
Looks like it is caused by the data quality issues, such as no-enough coverage or containments. Even the assembly can work, the produced assemblies will still be very fragmented. I have no idea how to fix data quality issues, probably having more coverage, or finding solutions to remove containments?
What do you mean for data quality issues?
This is how the HiFi data looks like, the median QV is 32
what else could I do about it?
A good HiFi dataset should have a k-mer plot like issue10 or issue49. The k-mer plot of your dataset is very bad, i.e., there are large numbers of k-mers only occurring a few times.
yep, I agree that the low frequency k-mers are the issue. But if we add coverage and move some of those k-mers to the right, then there will be more data that a 1 TB machine can't assemble. Do you see the conundrum? How do we get out of here? Thanks
For the weird k-me plots like yours, hifiasm cannot correctly determine the right threshold for error correction, leading to the large memory requirement. If it gets a nice k-mer plot, then the memory won't be a problem as hifiasm is able to identify the right threshold. But I'd recommend you to first check why the k-emr plot is weird.