hifiasm
hifiasm copied to clipboard
There are large differences in my results
hi, there.
I am assembled a diploid plant genome by two cells of Hifi reads with default HiC mode in HIFIasm(0.15.4-r343). The genome size is expected to be 350M and the heterozygosity is 1.3%.
However, the size of the results was weird: Cell1(70X): hap1 = 350M, hap2 = 367M, p_ctg = 377M Cell2(70X): hap1 = 399M, hap2 = 342M, p_ctg = 409M --h1 hic1.fq.gz --h2 hic_2.fq.gz cell1.fq cell2.fq: hap1 = 613M, hap2 = 650M, p_ctg = 691M. I am already tried the -l parameter, and the result was the same.
Could you please give some advice to deal with those situations? thanks.
By the way, I tried to estimate the genome size by jellyfish with the combined Hifi reads, the genome size is twice as much as Cell1 or Cell2. Meanwhile, the estimated genome size is equaled between Cell1 and Cell2 = 350M.
Do you mean bp.hap1
and bp.hap2
without Hi-C are around 350Mb in size, while hic.hap1
and hic.hap2
with Hi-C are around 700Mb in size?
Do you mean
bp.hap1
andbp.hap2
without Hi-C are around 350Mb in size, whilehic.hap1
andhic.hap2
with Hi-C are around 700Mb in size?
I mean the result of Hic + Cell1
or Hic + Cell2
are around 350Mb, but hic.hap1
and hic.hap2
are around 700Mb while working with Hic + Cell1 + Cell2
.
For 700Mb assemblies, I guess hifiasm misidentified hom peak automatically (see: https://hifiasm.readthedocs.io/en/latest/interpreting-output.html#hifiasm-log-interpretation, which only works for v0.15.5). In this case, you should set --hom-cov
manually. As for unbalanced two haplotypes, please set smaller value for -s
(default: 0.55) due to high heterozygosity rate.
You can rerun v0.15.5 on top of v0.15.4 bin files.
Thanks. I will reinstall v0.15.5
and try it again.
Thanks for your help, and hic.hap1 and hic.hap2
seems OK. Can I perform a farther anchor of p_ctg, hap1, and hap2
with those Hi-C reads? I also got a Not expected result of pseudo-chromosomes by using ALLHIC.
What is farther anchor? I guess you can do scaffolding with any scaffolders. You can try separately scaffolding for each haplotype, or joint scaffolding on top of both haplotypes.
I have tried ALLHIC to scaffolds our genome with HiC data, and the result seems weird:
The chromosome id and the number of bases of p_ctg
from Cell1 + Cell2 + HiC are as follows:
g1 26258780
g2 12498225
g3 11551982
g4 11216680
g5 10688172
g6 10306402
g7 4796867
g8 3184290
g9 395536
While the results of the p_ctg
form Cell1 reads are as follows:
g1 66343369
g2 43337939
g3 33907017
g4 33184760
g5 32353240
g6 27289111
g7 26985006
g8 26914006
g9 17971427
This is my running command, what can I do to improve the assembly?
hifiasm -o hic -t 38 --hom-cov 150 -s 0.30 --h1 hic_1.fq.gz --h2 hic_2.fq.gz cell1_hifi.fq.gz cell2_hif.fq.gz
hifiasm -o cell1 -t 38 cell1_hifi.fq.gz
Thank you, Yilei
If ALLHiC does not work, probably you can use salsa2 or 3d-dna. I guess any scaffolders should work as the contigs are already pretty good?
The Salsa2
scaffolds of p_ctg
(Cell1 + Cell2 + HiC) are the same as the results of p_ctg
from Cell1 + Cell2 + HiC
. Besides, the scaffolds of Cell1 p_ctg
are the same as scaffolding of contigs that Canu assembly with Cell1 + Cell2
Hifi reads.
It seems like the contigs from single-Cell are better than assembly with 2 Cells.
Just make sure: do you scaffold on *hic.hap1*
and *hic.hap1*
individually?
Hi @chhylp123 ,
I just want to know what data should I use for downstream analysis after getting *hic.hap1* and *hic.hap2* and *hic.p_ctg*.
- Scaffold on them individually and just use one of them for gene annotating?
- Scaffold on *hic.hap1* and *hic.hap2* individually and do gene annotation individually?
- Scaffold on *hic.hap1* and *hic.hap2* individually and join them together and use the union for gene annotating? ...
Best, Kun
For scaffolding, scaffold on *hic.hap1*
and *hic.hap2*
individually should be easier and OK. I also recommend to have a try to scaffold on both of them together for comparison. I'm not familiar with annotation so I have no idea for that.
Hi, Actually, the key problem is that I still don't know why we need a haplotype-resovled assembly, or why p_ctg is not enough? Can some one explain clearly? Best, Kun
Hi, Actually, the key problem is that I still don't know why we need a haplotype-resovled assembly, or why p_ctg is not enough? Can some one explain clearly? Best, Kun
- Sun, X., Jiao, C., Schwaninger, H. et al. Phased diploid genome assemblies and pan-genomes provide insights into the genetic history of apple domestication.
- Chen H et al., Allele-aware chromosome-level genome assembly and efficient transgene-free genome editing for the autotetraploid cultivated alfalfa.
- Garg S et al. Chromosome-scale, haplotype-resolved assembly of humangenomes.
- Cheng, H., Concepcion, G.T., Feng, X., Zhang, H., Li H. (2021) Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm.
It depends on your applications. But usually haplotype-resolved assembly should be better.
Just make sure: do you scaffold on
*hic.hap1*
and*hic.hap1*
individually?
Yes. they are scaffolded individually, and all of the jobs followed a default pipeline. As the Allhic group
mentioned that maybe the initial assembly has many chimeric contigs75.
May the HiFi reads from different Cells would make more chimeric contigs in this situation? or just my sample has some problem rather than software.
I don't think hifiasm will lead to so many chimeric contigs. And you shouldn't mix reads from two samples to do assembly, which may affect the assembly quality.
If you scaffold different haplotypes individually, any scaffolder should work like 3D-DNA or salsa2. I guess the unique feature of AllHiC is that it can assign each contig to one of the two haplotypes. However hifiasm has already done that. You can also run AllHiC to scaffold both hap1 and hap2 at once for comparison.