hifiasm
hifiasm copied to clipboard
Feature request: Separate hifiasm into stages
Hi,
Is it possible to separate hifiasm into stages (e.g. separating the read-error correction step and the phased string graph generation step)?
The application that initially led us to ask for this functionality is when we want to have both the diploid assembly and the alternative contigs for some investigation.
Thank you! Steve
I am also interested in this, I looked at doing it by modification of the source code and while I succeeded it was quite challenging and the solution I came up with was a little bit hacky.
You can easily rerun with the bin file to get primary/alternative, dual assembly or trio/hic assembly if you use the same prefix
Oh, that's good to know., @baozg
Just to confirm, hifiasm
will automatically "resume" the work, if it detects the bin files matching the provided prefix?
Yes, hifiasm will reuse all the bin files if they exist. But be careful if it is generated by a different version of hifiasm.
Awesome! I'll test run with our samples and report back.
Thank you @baozg !
Hello @vellamike @SHuang-Broad @baozg , sorry for the late reply since I was too busy during the last few weeks. Actually the ‘--bin-only’ might work. For example, if you would like to run hifiasm (Hi-C) in one step, then the command line should as follows:
hifiasm -t48 –h1 HiC_r1.fq –h2 HiC_r2.fq HiFi.fq
With ‘--bin-only’, the whole assembly procedure could be separated into two steps:
hifiasm -t48 –h1 HiC_r1.fq –h2 HiC_r2.fq --bin-only HiFi.fq ///hifiasm will only produce bin files for error correction hifiasm -t48 –h1 HiC_r1.fq –h2 HiC_r2.fq --bin-only HiFi.fq ///hifiasm will reuse the bin files
Basically, hifiasm will directly stop if any bin files have been generated with ‘--bin-only’.
Thank you, @chhylp123 !
Following your suggestion, I ran a few experiments and it works as expected!
I've attached a few plots here demonstrating how CPU, memory and disk space is used throughout the process. Hopefully this is useful. For bin generation, I used 42 cores. For the actual assembly steps, I used 28 cores.
Btw, this --bin-only
flag isn't documented anywhere but I believe it should. Here's the reason: you can see from the monitoring plots, that the bin-generation stage is the main "bottleneck". It needs the most amount of resources and lasts 16 hours. The assembly steps, not only use just a few threads most of the time (~2 hours), but don't need as much memory either. For those of us who do computations in the cloud, we can reduce costs by using non-spot VMs for the bin-generation stage, and switch over to spot VMs configured with less resources.
Again, thank you for the suggestion! AltModeUsingBinFiles.HighCoverage.monitoring.log.pdf BinGeneration.HighCoverage.monitoring.log.pdf HapModeUsingBinFiles.HighCoverage.monitoring.log.pdf
Steve