FASTK copied to clipboard
Understanding how kmers are counted
I want to understand how kmers are counted in FastK and how that affects the totals in merquryFK calculations.
Why do the total values in completeness.stats and qv files differ so much? What do they represent and how they relate to each other? I run merquryfk with a single genome assembled using Pacbio HiFi and HiC data, and run against an Illumina kmer dataset.
# mMelMel1_T1.qv
Assembly No Support Total Error % QV
GCA_922984935.2.subset 6005 7999890 0.0024 46.2
# mMelMel1_T1.completeness.stats
Assembly Region Found Total % Covered
GCA_922984935.2.subset all 2268391877 2268397787 100.00
From Merqury,
The Total in QV are kmers that are 'present' in the assembly. So if there is one specific kmer found 3 times in the assembly, but never in the reads, it is counted as 3 error kmers (no suppurt). The 3 error kmers are part of the Total.
The Total in completeness are distinct solid kmers in the reads. In other words, a kmer that is present over a certain frequency in the reads is counted as one kmer. I forgot how exactly the Total is computed in MerquryFK completeness. It's likely that it is only filtering out kmers with frequency of 1, which is the default in FastK? Might be a good question for Gene.
We expected the opposite because the total for QC is ~8M whereas the total for Completeness is ~2.2B.