krakenuniq
krakenuniq copied to clipboard
Does krakenuniq ever suppress (countable) kmers from reads?
Dear pioneers for much improved pathogen detection,
we have an example from nanopore (resulting in long reads > 1500 bps) where the kraken1-part of krakenuniq counts over 12000 kmers (k = 31) for a species (taxid 47466). Our own analysis revealed that these reads probably belong to just 26 unique kmers. Thus, the counted kmers are most likely "trash" in a sense that they don't indicate the presence of a corresponding pathogen. BUT: "47466" does not even occur in the corresponding krakenuniq report!
We assume there is a kind of unbuilt filter in krakenuniq to suppress such findings - or is it maybe a bug? Could you briefly explain under which conditions you keep or conversely filter out counted kmers (if so) in krakenuniq reports? (We could potentially hand over the corresponding fastq file if you suspected a bug.)
Thanks and best regards, Daniel
hi @pfeiferd, that simply should not happen: kraken1 and krakenuniq should have identical k-mer counts, as long as you used the same database. I looked it up and I see that taxid is Borrelia miyamotoi. There is no hidden or undocumented filter in krakenuniq. Those are probably low-complexity k-mers, but even so the krakenuniq output should include them. Did you use the identical database file each time you ran the 2 programs? If you built one database using 'dust' and the other without it, that would explain it. Dust masks out low-complexity sequences and kraken-build uses it by default.
Thank you so much! I was away, but I will look into it this week and tell what's going on...
Dear Steven,
I can reproduce the beviour as described above (which is supposedly a bug according to your answer). We / I can provide you with the necessary files via a personal FTP server (the fastq file is about 121MB). We used a standard krakenuniq database ("kuniq_standard_plus_eupath_minus_kdb").
You could get the following files:
- The fastq File (with nanopore reads > 1500 bps),
- the krakenuniq report (with tax id 47466 missing),
- the corresponding kraken1 output (with 1000s of hits regarding 47466),
- our own counting for 47466 according to the kraken1 output and for some other taxids...
Let me know what you think. The FTP-Server credentials would have to be transferred via email / Skype or so.
Daniel
I'm afraid I don't have time to debug this myself. I will ask a person in my lab about it, but it might take some time. It is possible that your reads are long enough that they got truncated by one program - although that isn't supposed to happen anyway, there's no read length limit. So I can't really guess, but we've never seen this behavior. If you send me email (rather than github) at [email protected], I will see if someone can take a look. All we really need, probably, is the fastq file. If it's not too big you can just email that to me. Presumably you just have to include some of the reads that match 47466.
Just sent a respective email... Thanks so much for looking into this.