FASTK icon indicating copy to clipboard operation
FASTK copied to clipboard

Understanding how kmers are counted

Open priyanka-surana opened this issue 2 years ago • 0 comments

I want to understand how kmers are counted in FastK and how that affects the totals in merquryFK calculations.

Why do the total values in completeness.stats and qv files differ so much? What do they represent and how they relate to each other? I run merquryfk with a single genome assembled using Pacbio HiFi and HiC data, and run against an Illumina kmer dataset.

# mMelMel1_T1.qv 
Assembly	No Support	Total	Error %	QV
GCA_922984935.2.subset	6005	7999890	0.0024	46.2

# mMelMel1_T1.completeness.stats 
Assembly	Region	Found	Total	% Covered
GCA_922984935.2.subset	all	2268391877	2268397787	100.00

From Merqury, https://github.com/marbl/merqury/issues/84

The Total in QV are kmers that are 'present' in the assembly. So if there is one specific kmer found 3 times in the assembly, but never in the reads, it is counted as 3 error kmers (no suppurt). The 3 error kmers are part of the Total.

The Total in completeness are distinct solid kmers in the reads. In other words, a kmer that is present over a certain frequency in the reads is counted as one kmer. I forgot how exactly the Total is computed in MerquryFK completeness. It's likely that it is only filtering out kmers with frequency of 1, which is the default in FastK? Might be a good question for Gene.

We expected the opposite because the total for QC is ~8M whereas the total for Completeness is ~2.2B.

priyanka-surana avatar Oct 26 '22 13:10 priyanka-surana