hifiasm
hifiasm copied to clipboard
Drastic differences in haplotype size with trio
I ran hifiasm (trio) on 3 SMRT cells of diploid CCS data for a F1 hybrid on which I have parental Illumina data. I expect two haplotypes around 2.8gb, however, the size of the haplotypes is drastically different:
dip.hap1.p_ctg.fa number of contigs/scaffolds:687 assembly size:2160070491 largest contig/scaffold:90943655 N50:30115233
dip.hap2.p_ctg.fa number of contigs/scaffolds:815 assembly size:4804195238 largest contig/scaffold:116454869 N50:36604256
I have binned the data with meryl and tried another assembler. The HiFi data binning was successful and split the data almost exactly in half based on the parental short reads. However the assembly contiguity is worse than hifiasm, so I would prefer to fix this issue. Are there suggested parameter modifications?
Could you please show the log file?
I guess it might be caused by the incorrect homozygous coverage threshold.
See attached log F1_hfsmtrio.log
Thanks. Have you checked the hamming error rate of two haplotypes by yak trioeval
?
Here are the yak trioeval results for both haplotypes yak_trioeval.log .
The hamming error rates of hap1 and hap2 are: H 2932402 8167066 0.359052
and H 4924601 19621969 0.250974
, which are too high. Are you sure you are using the right parental short reads?
They should be. Could there be another cause?
Might be. But could you please first run yak trioeval
on top of the assemblies which are sucessfully generated by another assembler? I hope to figure out if this is the issue of the parental data.
Sure thing. This is for a canu2.2 assembly with prior-binned reads. Worse contiguity, but even haplotype lengths. yak_trioeval_2.log
Thanks. For canu2.2 assembly, the switch/hamming error rates are still around 25%. I guess something is wrong on your parental data.
Let me investigate. I sincerely appreciate your time. Keep this open till I nail down the parental information.
For example, the last two lines of yak trioeval
outputs are:
W 2671444 10788629 0.247617 H 3585307 10790067 0.332278
W-line means the switch error rate is 24.76%, H-line means the hamming error rate is 33.23%.
I tried using the (very useful) Hi-C haplotype splitting approach, and it worked very well, so there must be an issue with the parents. Thank you for your help
In that vein, is there a way to obtain the read information from the Hi-C data that was used to split haplotypes? I would like to use those reads directly in scaffolding without having to re-split them based on the newly created haplotypes.
For clarity, I realize that hap1 and hap2 can potentially be mixed parentals chromosome to chromosome. However, due to the divergence between my parent species, the correct haplotype should be readily discernible.
If I can derive which reads were used to phase the haplotypes for both haps along each contig, I can disentangle this without having to re-map to each haplotype and phase again using another algorithm, which is time consuming (almost 1 billion Hi-C reads). If not, it would be a good feature request to bin the reads when doing the Hi-C per-contig so we can immediately begin scaffolding.
Hifiasm is not intended for your purpose. Please use bwa-mem or chromap to map Hi-C reads.