With bigger kmer found more similar sequences
Hi developers,
I am recently use ModDotPlot to visualize CHM13 chr1, and I found a curious thing.
moddotplot static -f CHM13#chr1.fa -o CHM13#chr1 -k 50 -id 90
moddotplot static -f CHM13#chr1.fa -o CHM13#chr1_21 -k 21 -id 90
With bigger kmer but found more similar sequences, this looks doesn't make sense, right?
Best, Shuo
@unavailable-2374 Thanks for bringing this up, and sorry for the late reply! The answer for why you're seeing more similar sequence with a higher k has to do with how ModDotPlot estimates sequence identity.
Equation 5 from our paper:
We estimate ANI based on a binomial distribution of k-mer survival. If we do some numbers to reach a 90% identity threshold: The containment index c(a,b) needs to be ~10% for k=21, but only ~0.4% for k = 51! For a window size of 16,000bp, this means only 64 k-mers need to be contained, so an occasional TE like Alu will trigger this threshold.
Essentially what this means is that, when using a high window size and a high kmer value, the error bounds are high. If we do the same experiment using a smaller window size (500 in this case), we see what you expect of being more selective:
k = 21:
k = 51:
Mash Screen (which uses the same ANI estimation) outputs error bounds to calculate optimal values of k and s (sketch size) for this situation. Perhaps I should also include that :)