hifiasm icon indicating copy to clipboard operation
hifiasm copied to clipboard

High BUSCO duplication rate for low coverage HiFi data

Open BitaoQiu opened this issue 1 year ago • 5 comments

Dear authors,

I am applying Hifiasm to assembly our Hifi read data (DNA from a single individual for each species). However, we notice that the duplication rate (based on BUSCO) is negatively correlated with estimated read coverage. And we further confirmed with previous published short-read assembly that the high duplication rate in our low-coverage assemblies is not species-specific (see attached). And I think the correlation is because low sequencing depth leads to difficulty in assembling heterozygous regions.

We are now using Purge_Dups to remove potential false duplication, but may I ask is there any other parameters that we could use in hifiasm, where one can account for the sequencing depth?

Species Sex Estimated coverage N50 # contig Estimated genome size Contig length Completeness Singleton Duplicated Fragment Missing
Odontotermes F. 28 5800000 1291 1194220328 1449477267 99.4% 97.4% 2 % 0.4% 0.2%
Trinervitermes U. 10 605000 6072 348834812 1726917765 98.1% 83 % 15.1% 1 % 0.9%
C. secundus U 13 1143825 3705 1182129932 1296819224 99.5% 90.1% 9.4% 0.2% 0.3%
C. secundus(Short reads, reference) Mix   1184893 55483 1182129932 1018932804 98.8% 97 % 1.8% 0.7% 0.5%
Macrotermes bellicosus PNP M 12 564726 5671 1214356371 1411697024 99.3% 89 % 10.3% 0.4% 0.3%
Macrotermes bellicosus (HiFi, reference)   21 11 MB 428 1113805679 1341469195 99.7% 96.5% 3.2% 0 0.3%

BitaoQiu avatar Sep 28 '22 16:09 BitaoQiu

For the coverage issue, I guess so. The built-in purge_dups of hifiasm could be tuned like: https://hifiasm.readthedocs.io/en/latest/faq.html#p-large, which may work better in some cases.

chhylp123 avatar Sep 28 '22 18:09 chhylp123

Thank you for the reply... However, after setting --purge-max = 20 (twice the homozygous read coverage based on K-mer) and -s = 0.3, the BUSCO duplication rate still remains very high (15%).

We also tested genome assembly with Hi Canu. By setting correctedErrorRate=0.105 (https://canu.readthedocs.io/en/latest/parameter-reference.html#correctederrorrate), we got a BUSCO duplication rate of 3.9%, which is more reasonable to us...

BitaoQiu avatar Sep 30 '22 12:09 BitaoQiu

Are the assemblies of HiCanu more contiguous than those of hifiasm?

chhylp123 avatar Sep 30 '22 18:09 chhylp123

It seems so.. but of course, it's only for the species that with very low sequencing coverage ...

Species Method N50 # contig Estimated genome size Contig length Completeness Singleton Duplicated Fragment Missing
Trinervitermes10x Hifiasm 605 KB 6072 X 1726917765 98.1% 83 % 15.1% 1 % 0.9%
  Hifiasm + purge 716 KB 3610   1457166292 96.9% 94.8% 2.1% 1.5% 1.6%
  Canu 734 KB 10555   1593394689 98.6% 94.7% 3.9% 1.1% 0.3%

BitaoQiu avatar Sep 30 '22 19:09 BitaoQiu

Thanks a lot. It is reasonable as we haven't optimized hifiasm for such low coverage of reads.

chhylp123 avatar Oct 03 '22 04:10 chhylp123