hifiasm
hifiasm copied to clipboard
Feature request: Separate hifiasm into stages
Hi,
Is it possible to separate hifiasm into stages (e.g. separating the read-error correction step and the phased string graph generation step)?
The application that initially led us to ask for this functionality is when we want to have both the diploid assembly and the alternative contigs for some investigation.
Thank you! Steve
I am also interested in this, I looked at doing it by modification of the source code and while I succeeded it was quite challenging and the solution I came up with was a little bit hacky.
You can easily rerun with the bin file to get primary/alternative, dual assembly or trio/hic assembly if you use the same prefix
Oh, that's good to know., @baozg
Just to confirm, hifiasm
will automatically "resume" the work, if it detects the bin files matching the provided prefix?
Yes, hifiasm will reuse all the bin files if they exist. But be careful if it is generated by a different version of hifiasm.
Awesome! I'll test run with our samples and report back.
Thank you @baozg !
Hello @vellamike @SHuang-Broad @baozg , sorry for the late reply since I was too busy during the last few weeks. Actually the ‘--bin-only’ might work. For example, if you would like to run hifiasm (Hi-C) in one step, then the command line should as follows:
hifiasm -t48 –h1 HiC_r1.fq –h2 HiC_r2.fq HiFi.fq
With ‘--bin-only’, the whole assembly procedure could be separated into two steps:
hifiasm -t48 –h1 HiC_r1.fq –h2 HiC_r2.fq --bin-only HiFi.fq ///hifiasm will only produce bin files for error correction hifiasm -t48 –h1 HiC_r1.fq –h2 HiC_r2.fq --bin-only HiFi.fq ///hifiasm will reuse the bin files
Basically, hifiasm will directly stop if any bin files have been generated with ‘--bin-only’.
Thank you, @chhylp123 !
Following your suggestion, I ran a few experiments and it works as expected!
I've attached a few plots here demonstrating how CPU, memory and disk space is used throughout the process. Hopefully this is useful. For bin generation, I used 42 cores. For the actual assembly steps, I used 28 cores.
Btw, this --bin-only
flag isn't documented anywhere but I believe it should. Here's the reason: you can see from the monitoring plots, that the bin-generation stage is the main "bottleneck". It needs the most amount of resources and lasts 16 hours. The assembly steps, not only use just a few threads most of the time (~2 hours), but don't need as much memory either. For those of us who do computations in the cloud, we can reduce costs by using non-spot VMs for the bin-generation stage, and switch over to spot VMs configured with less resources.
Again, thank you for the suggestion! AltModeUsingBinFiles.HighCoverage.monitoring.log.pdf BinGeneration.HighCoverage.monitoring.log.pdf HapModeUsingBinFiles.HighCoverage.monitoring.log.pdf
Steve
Hi!
After reading this I am still not sure how to reuse the bin files. Are all the generated bin files needed?
My -o
includes the path, and tried both to use the same file prefix in a different folder, and also to rerun in the same folder and both times it seems the whole procedure is rerun. How should it be done?
Is there an easy way to check if the pipeline is resuming or running from the beginning?
Thank you in advance,
Stelios
Basically, just rerun hifiasm with the same option for -o, hifiiasm will reuse the bin files. The log file will tell you if the bin files have been reused. if the pipeline is resuming, hifiasm will skip the whole error correction step without printing any k-mer histogram.
This is not happening for me. I run hifiasm again for a different sample, and when I tried to reuse the bin files of the first sample using the original -o prefix, it reruns the whole pipeline. Maybe something got slightly mixed up, I will try again when it finishes. Thanks!