kraken2
kraken2 copied to clipboard
Missing species in inspect file after database build
Hi Kraken2 developers/community,
I recently built a large Kraken2 database with genomes from the NCBI RefSeq database. I added genomes regardless of assembly level and limited it to 1 assembly per species. But after doing some testing, I discovered that there seemed to be some missing species from the database. Can someone tell me why this is?
I wanted to make sure everything was added correctly after the build, so I ran the inspect command and then compared the taxids in the seqid2taxid.map file to the ones in the inspect file. And there are 897 taxids in the seqid2taxid.map file that were not present in the inspect file. I did some digging and it seems like those species were not added because they didn't have unique minimizers. Can anyone confirm this?
This is the number of species missing per NCBI division: 598 Bacteria 37 Invertebrates 16 Phages 206 Plants and Fungi 1 Rodents 36 Vertebrates 3 Viruses
Notes from further investigating particular missing species
- Invertebrates - Acropora genus
- 6 missing species in genus (all mitos)
- 16 species in database (2 wg 14 mito)
- Rodents - Mus musculus domesticus
- added to seqid2taxid.map properly
- missing 1 mito (domesticus)
- mito in inspect file with least amount of minimizers = 19 minimizers
- In DB: Mus musculus has wg and 3 mitos from subspecies other than domesticus
- Vertebrates:
- Kali and Beta genus
- kept 1 mitogenome from 1 species in genus (Betta - also 1 wg)
- Some hybrids removed
- Vipera berus - mito removed - no other genomes in genus, but other genomes in family
- Kali and Beta genus