hifiasm icon indicating copy to clipboard operation
hifiasm copied to clipboard

Memory consumption

Open diego-rt opened this issue 8 months ago • 9 comments

Hi again,

Apologies for opening a new issue but this time it is on a different topic.

We are trying to assemble a giant genome of ~30 Gb using 70x Hifi coverage and we have encountered a substantial memory consumption of circa ~2T. These are the flags we've used:

hifiasm -o Assembly.asm -t 176 -l 3 -f 40 -D 5 -k 63 -w 63 Revio.fq.gz

These are the resources used:

[M::main] Version: 0.19.6-r595

[M::main] Real time: 810896.293 sec; CPU: 67823081.546 sec; Peak RSS: 1837.470 GB

Notably, the first two error correction rounds are reasonable memory wise but the last one nearly doubles. Is this expected?

[M::ha_assemble::232705.176*[email protected]] ==> corrected reads for round 1
[M::ha_assemble] # bases: 1987058683818; # corrected bases: 3830287191; # recorrected bases: 3478967
[M::ha_assemble] size of buffer: 27.072GB

[M::ha_assemble::347940.855*[email protected]] ==> corrected reads for round 2
[M::ha_assemble] # bases: 1987230338200; # corrected bases: 107238566; # recorrected bases: 460786
[M::ha_assemble] size of buffer: 24.663GB

[M::ha_assemble::655301.727*[email protected]] ==> corrected reads for round 3
[M::ha_assemble] # bases: 1987234212699; # corrected bases: 6497264; # recorrected bases: 541425
[M::ha_assemble] size of buffer: 23.630GB

[M::ha_pt_gen::685751.842*92.90] ==> indexed 45341859427 positions, counted 695474658 distinct minimizer k-mers
[M::ha_assemble::706174.286*[email protected]] ==> found overlaps for the final round
[M::ha_print_ovlp_stat] # overlaps: 13092338655
[M::ha_print_ovlp_stat] # strong overlaps: 4166602012
[M::ha_print_ovlp_stat] # weak overlaps: 8925736643
[M::ha_print_ovlp_stat] # exact overlaps: 12818368372
[M::ha_print_ovlp_stat] # inexact overlaps: 273970283
[M::ha_print_ovlp_stat] # overlaps without large indels: 13072073906
[M::ha_print_ovlp_stat] # reverse overlaps: 3181263971

Some questions listed as points for brevity:

  1. I understand that -k and -w raise memory consumption but given the speed boost and theoretical assembly benefits I would rather keep using them. I was wondering whether there is anything else I could tweak to reduce memory consumption without sacrificing assembly quality?
  2. Do you think my bloom filter is too large? Should I reduce it? What is the proper way to estimate the -f parameter?
  3. What does size of buffer mean? Is this the size of the bloom filter?
  4. It seems like the memory consumption doubles from rounds 2 to 3. Is there anything that could still be deallocated at that point that perhaps is not? Sorry for the annoying question but our node only has 1900G RAM and memory limits are getting in the way of us experimenting with higher '-D' flags or adding additional reads (i.e. duplex reads) to the mix.

Thanks a lot once again!

diego-rt avatar Oct 23 '23 15:10 diego-rt