Struo2
Struo2 copied to clipboard
More detail on classification % and taxonomy issue for Kraken2
Hi again! My supervisor and I have been troubleshooting our efforts to update the GTDB_release207 pre-made custom database with a selection of genomes from JGI GOLD (see previous issue #29). Based on Nick’s suggestion, I’m thinking of using the fastANI package to test the mean genomic distance between genomes in our TSV of genomes from the JGI GOLD and the genomes in GTDB_release207, but we’ve got a couple of reasons to think that we have a separate problem. We’re actually expecting relatively high ANI similarity because we’re focusing on expanding subspecies-level diversity in the database rather than adding new lineages, but out of the ~4,000 new genomes we’ve added, there were none that were identified with the read data we’re using. Basically, after adding our new genomes, we’re only getting differences in which reads are classified with which database genomes, rather than having any reads classified with our newly added genomes. We know the pipeline and our input TSV for the update should be working properly, since we tested it out by using that JGI input to update the GTDBr95_n10 toy database and saw an increase in classification percentage of read data (from small fractions of 1% to a range of values between 5% and 10% depending on the read data we were classifying). Naturally, for that testing process, we are seeing reads classified with the added JGI taxa, which is encouraging, but we’re still seeing some things we can’t really make heads or tails of. For instance, when we analyze the Kraken2 and Bracken output and report files for Kraken2 and Bracken calls on either the test (toy + JGI) or the experimental (GTDB + JGI) databases, we see that most of the rows in those files refer to tax IDs that don’t exist in the database. In some cases, I think this is due to the tax IDs being hashed as Struo2 adds the new genomes to the database, but I’m not sure why this would be the case for some but not others. The main thing we’re seeking advice about then is how we can go about linking these unidentified tax IDs with their source taxonomies, and how we can make sure that we don’t have an issue with the pre-made GTDB_release207 taxonomy not merging properly with the taxonomies in our input TSV. I can send over the input TSV we've been using (as well as any other relevant files) as needed. Thanks!
Also, as a side note, we’re wondering if you have a copy of the version of the sample TSV that was used to generate the pre-made GTDB_release207 database. It seems like there are more total genomes in the ar53_metadata_r207.tsv and bac120_metadata_r207.tsv files than in the pre-made database itself, so we’re thinking those files were combined and filtered to generate the original TSV.