hifiasm icon indicating copy to clipboard operation
hifiasm copied to clipboard

Assemble a social insect species which has only one set of chromosomes lacking of paternal chromosome (like a gametes genome)

Open yshcai opened this issue 2 years ago • 4 comments

I assemble a haploid insect genome by using hifiasm v0.16.1-r375. It should be noted that my species has only one set of chromosomes lacking of paternal chromosome because it develops from one unfertilized egg which produces less than one hundred offsprings. These offsprings have one set of chromosomes, and this biological phenomenon is parthenogenesis which is common in social insects such as bees. I sampled these offsprings developed from one unfertilized egg to sequence in order to decrease heterozygosity rate. Hence, I think it nearly homozygous sample. I also performed k-mer analysis by using jellyfish (K=17) and ran Genomescope2 in haploid mode haploidy_linear_plot and diploid mode. diploidy_linear_plot

The results indicate this sample have low heterozygosity and the estimated genome size is 121Mb.

However, the kmer analysis using illumina data showed the estimated genome is 135M and have one peak at 64 depth. image

OK, I don't pay too much attention on different results between illumina short reads and hifi reads. I assemble this haploidy species genome by using hifiasm (hifiasm -o Mcin_WL302.asm --primary -t32 -l0 Mcin_WL302.ccs.fastq.gz) and the log file is hifiasm_asm.log.

I think this k-mer plot looks wired in log file and I have some questions as follows:

  1. I notice there is another peak very smaller at 30, but I don't know if this is a heterozygous read coverage because hifiasm prints this [M::ha_pt_gen] peak_hom: 121; peak_het: -1;
  2. Which homozygous read coverage I should select? There is a new homozygous read coverage after each round for reads correction [M::ha_ft_gen] peak_hom: 117; peak_het: -1, [M::ha_pt_gen] peak_hom: 114; peak_het: 30, [M::ha_pt_gen] peak_hom: 115; peak_het: -1, [M::ha_pt_gen] peak_hom: 121; peak_het: -1, [M::ha_pt_gen] peak_hom: 121; peak_het: -1 in log file. I set --hom-cov 117 and --hom-cov 121 and both the primary assembly size are nearly ~150Mb. The busco evaluation showed both the result are the same C:99.5%[S:97.0%,D:2.5%],F:0.1%,M:0.4%,n:1367. So how should I tune the option such as -D, -l and so on?

I feel very confused. Please give some advice, I really appreciate you very much.

yshcai avatar Jun 14 '22 14:06 yshcai

From my point of view, the assembly looks not bad. Could you please let me know which metrics you think is not good? The peak_hom selected by hifiasm itself should be correct, so you don't need to change it. As for the assembly size, the estimated size from reads often tends to be smaller. Based on the BUSCO scores, I guess the 150Mb might be correct?

chhylp123 avatar Jun 14 '22 19:06 chhylp123

Thank you for your reply! The assembly size is 152Mb and Contig N50 is 8.8Mb with option -l0. The result looks good. However, what makes me confused is that this k-mer plot in log file has a small peak at 30, I don't know if this is real heterozygous read coverage because hifiasm doesn't identify actually it as heterozygous peak (peak_het: -1). If this small peak is a heterozygous peak, I think I shouldn't use the option -l0 which is suitable for homozyous sample like CHM13. For homozygous samples, there should be one peak around read coverage. I just want to know what cause this small peak produce. Maybe it's a error.

yshcai avatar Jun 15 '22 08:06 yshcai

Might be somatic mutations or the remaining heterozygous regions. In most cases, genomes should not fully homozygous.

chhylp123 avatar Jun 15 '22 14:06 chhylp123

I see. Thanks a lot!

yshcai avatar Jun 16 '22 02:06 yshcai