hifiasm icon indicating copy to clipboard operation
hifiasm copied to clipboard

Weird k-mers histogram : How to filter out /exclude k-mers under a depth threshold with hifiasm

Open bbalog87 opened this issue 2 years ago • 8 comments

Dear developers of this great assembler,

I have a question concerning a weird k-mers histogram with 3 peaks for a definitely diploid fish genome. Please find the attached k-mers histogram as output by hifiasm log: hifiasm_kmers_histo.txt.

As you can see, it has 3 peaks, which shouldn't normally be for the diploid genome. The heterozygosity rate is estimated to be 2.2% as computed by KAT, and the heterozygous peak is at 63, while the homozygous peak is at 131. I have set the --hom-cov accordingly, given the high heterozygosity of this genome. The k-mers with a depth < 25 are most likely from contamination and/or erroneous k-mers. Is it somehow possible to exclude these k-mers through parameters tuning in hifiasm prior to assembly? Hifiasm is predicting the heterozygous k-mer peak at 17, which is wrong since the heterozygous peak is 63 according to jellyfish k-mer histogram: 27mers-histogram_clean_HiFi.pdf

The assembly command :

hifiasm -o AB.asm \
          -t 91 \
	  --hg-size 1200m \
	  --hom-cov 131 -l3  -s 0.35 \
	  --n-perturb 75000 \
	  --f-perturb 0.15 \
	  --n-weight 5 \
	  --h1 $HICreads1 \
	  --h2 $HICreads2 \
	  $AB

Best, Julien

bbalog87 avatar Jan 19 '22 15:01 bbalog87

Did hifiasm successfully produce some contigs?

chhylp123 avatar Jan 19 '22 19:01 chhylp123

Yes @chhylp123 , hifiasm produced contigs but with a few issues making the assembly suboptimal:

  • The assembly size is much larger (~200 Mb) than the estimated genome size (1200 Mb)
  • The two haplotypes are extremely unbalanced with more than a 25% length difference
  • High level (>25%) of duplication rate in Hap1 and Primary assemblies

Primary assembly statistics:


     # of contigs:    2176
        Total length (nt):    1367163925
        Longest sequence (nt):    52058640
        Shortest sequence (nt):    10324
        Mean sequence length (nt):    628292
        Median sequence length (nt):    96502
        N50 sequence length (nt):    18173130
        L50 sequence count:    24
  Scores in BUSCO format:    C:97.5%[S:71.3%,D:26.2%],F:0.5%,M:2.0%,n:3640

Hap1 assembly statistics:

  # of contigs:    2742
        Total length (nt):    1376558787
        Longest sequence (nt):    49496490
        Shortest sequence (nt):    10324
        Mean sequence length (nt):    502027
        Median sequence length (nt):    92239
        N50 sequence length (nt):    9898874
        L50 sequence count:   39
     Scores in BUSCO format:    C:97.2%[S:70.1%,D:27.1%],F:0.5%,M:2.3%,n:3640

Hap2 assembly statistics:

        # of contigs:    1773
        Total length (nt):    1015158721
        Longest sequence (nt):    28026528
        Shortest sequence (nt):    10324
        Mean sequence length (nt):    572566
        Median sequence length (nt):    102852
        N50 sequence length (nt):    5249658
        L50 sequence count:    48
     Scores in BUSCO format:    C:91.0%[S:82.1%,D:8.9%],F:0.7%,M:8.3%,n:3640

bbalog87 avatar Jan 19 '22 20:01 bbalog87

I think excluding low coverage k-mers (depth <26) from the assembly process could fix the issue of imbalanced haplotypes. However, I'm not figuring out which default hifiasm parameter/option I should set to that end.

bbalog87 avatar Jan 19 '22 20:01 bbalog87

The peak at depth 16 looks strange. It could be contamination. I would extract contigs with ~16X coverage and study what is happening. Alternatively, you can downsample input reads to ~40X coverage and reassemble.

Also note that k-mer based methods often underestimate the genome size. That is not ground truth.

lh3 avatar Jan 19 '22 20:01 lh3

Is the coverage information of contigs recorded in one of the hifiasm output files? Otherwise, I'll need to map HiFi-reads to the assembly to infer the contigs coverage.

bbalog87 avatar Jan 21 '22 09:01 bbalog87

Yes, the rd:i in gfa is the coverage information (see https://hifiasm.readthedocs.io/en/latest/interpreting-output.html#output-file-formats).

chhylp123 avatar Jan 21 '22 15:01 chhylp123

I have a confusion. The coverage information represents the kmer coverage information. The kmer coverage does not equal the reads coverage information according to the genomescope paper. Could you extract the kmer coverage information directly depending on the reads coverage information?

Jinke233 avatar Jun 23 '22 06:06 Jinke233

Sorry for the late reply. By the coverage information, do you mean rd:i? It is the local information of each unitig/contig, while the k-mer information is the global information at the whole genome-scale. I personally think the reads coverage information should be more accurate.

chhylp123 avatar Jun 30 '22 05:06 chhylp123