With bigger kmer found more similar sequences

Open unavailable-2374 opened this issue 7 months ago • 1 comments

Hi developers,

I am recently use ModDotPlot to visualize CHM13 chr1, and I found a curious thing.

moddotplot static -f CHM13#chr1.fa -o CHM13#chr1 -k 50 -id 90

moddotplot static -f CHM13#chr1.fa -o CHM13#chr1_21 -k 21 -id 90

With bigger kmer but found more similar sequences, this looks doesn't make sense, right?

Best, Shuo

May 05 '25 20:05 unavailable-2374

@unavailable-2374 Thanks for bringing this up, and sorry for the late reply! The answer for why you're seeing more similar sequence with a higher k has to do with how ModDotPlot estimates sequence identity.

Equation 5 from our paper:

We estimate ANI based on a binomial distribution of k-mer survival. If we do some numbers to reach a 90% identity threshold: The containment index c(a,b) needs to be ~10% for k=21, but only ~0.4% for k = 51! For a window size of 16,000bp, this means only 64 k-mers need to be contained, so an occasional TE like Alu will trigger this threshold.

Essentially what this means is that, when using a high window size and a high kmer value, the error bounds are high. If we do the same experiment using a smaller window size (500 in this case), we see what you expect of being more selective:

k = 21:

k = 51:

Mash Screen (which uses the same ANI estimation) outputs error bounds to calculate optimal values of k and s (sketch size) for this situation. Perhaps I should also include that :)

Aug 20 '25 17:08 alexsweeten