Struo2
Struo2 copied to clipboard
GTDB 207 Kraken db vs maxikraken2_1903_140GB db classification rate
Hi,
I have some trouble understanding the differences in classification rates between your GTDB 207 release Kraken database and the widely used maxikraken db from 2019, which is roughly half the size. I am classifying ~150 human stool sample metagenomes with kraken2 (2.1.2), using a 0.75 confidence score and default parameters otherwise and am consistently getting a ~10% higher unclassified rate with the GTDB database. This seems to stem a higher classification rate of bacteria in the maxikraken db. On the other hand I do get substantially higher sensitivity for Archaea with the GTDB one. Example (only highest levels): GTDB 207:
37.77 12279770 12279770 U 0 unclassified
62.23 20227993 68658 R 1 root
62.01 20159210 3273025 D 609216830 Bacteria
0.00 125 0 D 2587168575 Archaea
maxikraken:
26.57 8637310 8637310 U 0 unclassified
73.43 23870453 5007 R 1 root
73.19 23793211 1103748 D 2 Bacteria
0.00 68 0 D 2157 Archaea
0.00 68 0 D 2759 Eukaryota
0.00 257 0 D 10239 Viruses
I am confused as to why that is. I could understand that, given the much higher information content in the GTDB db, some classifications would be 'pushed' higher in the tax hierarchy with the confidence threshold used, as it turns out that with more data some k-mers aren't specific/unique for a taxon at that rank anymore. But since in my case they aren't even pushed to the root node but to unclassified, it seems to me that there are quite some k-mers that are just entirely missing from the GTDB db but present in the maxikraken one? Is this expected?
Best Oskar