hifiasm Drastic differences in haplotype size with trio

I ran hifiasm (trio) on 3 SMRT cells of diploid CCS data for a F1 hybrid on which I have parental Illumina data. I expect two haplotypes around 2.8gb, however, the size of the haplotypes is drastically different:

dip.hap1.p_ctg.fa number of contigs/scaffolds:687 assembly size:2160070491 largest contig/scaffold:90943655 N50:30115233

dip.hap2.p_ctg.fa number of contigs/scaffolds:815 assembly size:4804195238 largest contig/scaffold:116454869 N50:36604256

I have binned the data with meryl and tried another assembler. The HiFi data binning was successful and split the data almost exactly in half based on the parental short reads. However the assembly contiguity is worse than hifiasm, so I would prefer to fix this issue. Are there suggested parameter modifications?

Oct 08 '21 13:10 EvoMedLab

Could you please show the log file?

Oct 08 '21 17:10 chhylp123

I guess it might be caused by the incorrect homozygous coverage threshold.

Oct 08 '21 17:10 chhylp123

See attached log F1_hfsmtrio.log

Oct 08 '21 19:10 EvoMedLab

Thanks. Have you checked the hamming error rate of two haplotypes by yak trioeval?

Oct 08 '21 19:10 chhylp123

Here are the yak trioeval results for both haplotypes yak_trioeval.log .

Oct 08 '21 19:10 EvoMedLab

The hamming error rates of hap1 and hap2 are: H 2932402 8167066 0.359052 and H 4924601 19621969 0.250974, which are too high. Are you sure you are using the right parental short reads?

Oct 08 '21 20:10 chhylp123

They should be. Could there be another cause?

Oct 08 '21 20:10 EvoMedLab

Might be. But could you please first run yak trioeval on top of the assemblies which are sucessfully generated by another assembler? I hope to figure out if this is the issue of the parental data.

Oct 08 '21 20:10 chhylp123

Sure thing. This is for a canu2.2 assembly with prior-binned reads. Worse contiguity, but even haplotype lengths. yak_trioeval_2.log

Oct 08 '21 20:10 EvoMedLab

Thanks. For canu2.2 assembly, the switch/hamming error rates are still around 25%. I guess something is wrong on your parental data.

Oct 08 '21 20:10 chhylp123

Let me investigate. I sincerely appreciate your time. Keep this open till I nail down the parental information.

Oct 08 '21 20:10 EvoMedLab

For example, the last two lines of yak trioeval outputs are:

W 2671444 10788629 0.247617 H 3585307 10790067 0.332278

W-line means the switch error rate is 24.76%, H-line means the hamming error rate is 33.23%.

Oct 08 '21 20:10 chhylp123

I tried using the (very useful) Hi-C haplotype splitting approach, and it worked very well, so there must be an issue with the parents. Thank you for your help

In that vein, is there a way to obtain the read information from the Hi-C data that was used to split haplotypes? I would like to use those reads directly in scaffolding without having to re-split them based on the newly created haplotypes.

Oct 11 '21 13:10 EvoMedLab

For clarity, I realize that hap1 and hap2 can potentially be mixed parentals chromosome to chromosome. However, due to the divergence between my parent species, the correct haplotype should be readily discernible.

If I can derive which reads were used to phase the haplotypes for both haps along each contig, I can disentangle this without having to re-map to each haplotype and phase again using another algorithm, which is time consuming (almost 1 billion Hi-C reads). If not, it would be a good feature request to bin the reads when doing the Hi-C per-contig so we can immediately begin scaffolding.

Oct 11 '21 17:10 EvoMedLab

Hifiasm is not intended for your purpose. Please use bwa-mem or chromap to map Hi-C reads.

Oct 11 '21 17:10 lh3

hifiasm hifiasm copied to clipboard

Drastic differences in haplotype size with trio

hifiasm
hifiasm copied to clipboard