hifiasm There are large differences in my results

There are large differences in my results

Open Github-Yilei opened this issue 2 years ago • 20 comments

hi, there.

I am assembled a diploid plant genome by two cells of Hifi reads with default HiC mode in HIFIasm(0.15.4-r343). The genome size is expected to be 350M and the heterozygosity is 1.3%.

However, the size of the results was weird: Cell1(70X): hap1 = 350M, hap2 = 367M, p_ctg = 377M Cell2(70X): hap1 = 399M, hap2 = 342M, p_ctg = 409M --h1 hic1.fq.gz --h2 hic_2.fq.gz cell1.fq cell2.fq: hap1 = 613M, hap2 = 650M, p_ctg = 691M. I am already tried the -l parameter, and the result was the same.

Could you please give some advice to deal with those situations? thanks.

Aug 05 '21 10:08 Github-Yilei

By the way, I tried to estimate the genome size by jellyfish with the combined Hifi reads, the genome size is twice as much as Cell1 or Cell2. Meanwhile, the estimated genome size is equaled between Cell1 and Cell2 = 350M.

Aug 05 '21 10:08 Github-Yilei

Do you mean bp.hap1 and bp.hap2 without Hi-C are around 350Mb in size, while hic.hap1 and hic.hap2 with Hi-C are around 700Mb in size?

Aug 05 '21 13:08 chhylp123

Do you mean bp.hap1 and bp.hap2 without Hi-C are around 350Mb in size, while hic.hap1 and hic.hap2 with Hi-C are around 700Mb in size?

I mean the result of Hic + Cell1 or Hic + Cell2 are around 350Mb, but hic.hap1 and hic.hap2 are around 700Mb while working with Hic + Cell1 + Cell2.

Aug 05 '21 13:08 Github-Yilei

For 700Mb assemblies, I guess hifiasm misidentified hom peak automatically (see: https://hifiasm.readthedocs.io/en/latest/interpreting-output.html#hifiasm-log-interpretation, which only works for v0.15.5). In this case, you should set --hom-cov manually. As for unbalanced two haplotypes, please set smaller value for -s (default: 0.55) due to high heterozygosity rate.

Aug 05 '21 14:08 chhylp123

You can rerun v0.15.5 on top of v0.15.4 bin files.

Aug 05 '21 14:08 chhylp123

Thanks. I will reinstall v0.15.5 and try it again.

Aug 05 '21 14:08 Github-Yilei

Thanks for your help, and hic.hap1 and hic.hap2 seems OK. Can I perform a farther anchor of p_ctg, hap1, and hap2 with those Hi-C reads? I also got a Not expected result of pseudo-chromosomes by using ALLHIC.

Aug 07 '21 12:08 Github-Yilei

What is farther anchor? I guess you can do scaffolding with any scaffolders. You can try separately scaffolding for each haplotype, or joint scaffolding on top of both haplotypes.

Aug 10 '21 01:08 chhylp123

I have tried ALLHIC to scaffolds our genome with HiC data, and the result seems weird: The chromosome id and the number of bases of p_ctg from Cell1 + Cell2 + HiC are as follows: g1 26258780 g2 12498225 g3 11551982 g4 11216680 g5 10688172 g6 10306402 g7 4796867 g8 3184290 g9 395536 While the results of the p_ctg form Cell1 reads are as follows: g1 66343369 g2 43337939 g3 33907017 g4 33184760 g5 32353240 g6 27289111 g7 26985006 g8 26914006 g9 17971427

This is my running command, what can I do to improve the assembly？

hifiasm -o hic -t 38 --hom-cov 150 -s 0.30 --h1 hic_1.fq.gz --h2 hic_2.fq.gz cell1_hifi.fq.gz cell2_hif.fq.gz
hifiasm -o cell1 -t 38 cell1_hifi.fq.gz

Cell1.log combined.log

Thank you, Yilei

Aug 10 '21 03:08 Github-Yilei

If ALLHiC does not work, probably you can use salsa2 or 3d-dna. I guess any scaffolders should work as the contigs are already pretty good?

Aug 11 '21 02:08 chhylp123

The Salsa2 scaffolds of p_ctg (Cell1 + Cell2 + HiC) are the same as the results of p_ctg from Cell1 + Cell2 + HiC. Besides, the scaffolds of Cell1 p_ctg are the same as scaffolding of contigs that Canu assembly with Cell1 + Cell2 Hifi reads.

It seems like the contigs from single-Cell are better than assembly with 2 Cells.

Aug 11 '21 03:08 Github-Yilei

Just make sure: do you scaffold on *hic.hap1* and *hic.hap1* individually?

Aug 11 '21 14:08 chhylp123

Hi @chhylp123 ,

I just want to know what data should I use for downstream analysis after getting *hic.hap1* and *hic.hap2* and *hic.p_ctg*.

Scaffold on them individually and just use one of them for gene annotating?
Scaffold on *hic.hap1* and *hic.hap2* individually and do gene annotation individually?
Scaffold on *hic.hap1* and *hic.hap2* individually and join them together and use the union for gene annotating? ...

Best, Kun

Aug 11 '21 15:08 xiekunwhy

For scaffolding, scaffold on *hic.hap1* and *hic.hap2* individually should be easier and OK. I also recommend to have a try to scaffold on both of them together for comparison. I'm not familiar with annotation so I have no idea for that.

Aug 11 '21 15:08 chhylp123

Hi, Actually, the key problem is that I still don't know why we need a haplotype-resovled assembly, or why p_ctg is not enough? Can some one explain clearly? Best, Kun

Aug 11 '21 20:08 xiekunwhy

Hi, Actually, the key problem is that I still don't know why we need a haplotype-resovled assembly, or why p_ctg is not enough? Can some one explain clearly? Best, Kun

Sun, X., Jiao, C., Schwaninger, H. et al. Phased diploid genome assemblies and pan-genomes provide insights into the genetic history of apple domestication.
Chen H et al., Allele-aware chromosome-level genome assembly and efficient transgene-free genome editing for the autotetraploid cultivated alfalfa.
Garg S et al. Chromosome-scale, haplotype-resolved assembly of humangenomes.
Cheng, H., Concepcion, G.T., Feng, X., Zhang, H., Li H. (2021) Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm.

Aug 12 '21 02:08 Github-Yilei

It depends on your applications. But usually haplotype-resolved assembly should be better.

Aug 12 '21 03:08 chhylp123

Just make sure: do you scaffold on *hic.hap1* and *hic.hap1* individually?

Yes. they are scaffolded individually, and all of the jobs followed a default pipeline. As the Allhic group mentioned that maybe the initial assembly has many chimeric contigs75.

May the HiFi reads from different Cells would make more chimeric contigs in this situation? or just my sample has some problem rather than software.

Aug 12 '21 03:08 Github-Yilei

I don't think hifiasm will lead to so many chimeric contigs. And you shouldn't mix reads from two samples to do assembly, which may affect the assembly quality.

Aug 12 '21 03:08 chhylp123

If you scaffold different haplotypes individually, any scaffolder should work like 3D-DNA or salsa2. I guess the unique feature of AllHiC is that it can assign each contig to one of the two haplotypes. However hifiasm has already done that. You can also run AllHiC to scaffold both hap1 and hap2 at once for comparison.

Aug 12 '21 03:08 chhylp123

hifiasm hifiasm copied to clipboard

There are large differences in my results

hifiasm
hifiasm copied to clipboard