krakenuniq icon indicating copy to clipboard operation
krakenuniq copied to clipboard

Question on threshold of unique-kmers/reads to assess false positives

Open josruirod opened this issue 2 years ago • 2 comments

Hi, so I wanted to ask if nayone has any rule of thumb to assess the presence of false positives in the final krakenuniq report. For example, I'm seeing what I believe to be a false positive in the report, with ~250k reads classified, but only ~700 k-mers and duplication ~80k. And I think there may be others I can filter out with these measures... From the paper:

False-positive identifications have few unique k-mers

the k-mer threshold should always be several times higher than the read count threshold

So for starters, could I get rid of the results where unique k-mers < reads or reads/2, or something similar? Thanks for any comment

josruirod avatar Aug 29 '22 08:08 josruirod

Yes, that's a heuristic that my lab uses regularly: the number of unique k-mers per read should be something like (read length - k-mer length), or at least close to that, if the reads are random samples of their genome. So if the number of k-mers is low, then I assume this is a false positive due to low complexity sequence or else to contamination of the genome; i.e., all the reads appear to be hitting the same place. For example, if I have 100 reads and only 100 (or fewer) k-mers from a genome, I don't believe it. It's not a hard and fast rule, but I usually want k-mer counts to be at least 10 times as high as read counts, and ideally I want much more than that. Note that this is for DNA. If you are sequencing RNA, then highly-expressed genes will distort the counts.

Hi, so I wanted to ask if nayone has any rule of thumb to assess the presence of false positives in the final krakenuniq report. For example, I'm seeing what I believe to be a false positive in the report, with ~250k reads classified, but only ~700 k-mers and duplication ~80k. And I think there may be others I can filter out with these measures... From the paper:

False-positive identifications have few unique k-mers

the k-mer threshold should always be several times higher than the read count threshold

So for starters, could I get rid of the results where unique k-mers < reads or reads/2, or something similar? Thanks for any comment

salzberg avatar Aug 29 '22 11:08 salzberg

Thanks for the useful comments! Indeed this is RNA-seq I'm dealing with... but your comments are helpful. I'll come up with some rule too and filter out the things that cannot be right, even if highly-expressed genes are there.

Thanks!

josruirod avatar Aug 29 '22 12:08 josruirod