ampliseq icon indicating copy to clipboard operation
ampliseq copied to clipboard

Accessing 16S gene identifiers

Open annotatebio opened this issue 10 months ago • 13 comments

Description of feature

As one of the input database files (silva_species_assignment_v138.1.fa.gz) stores an information about the 16S gene primary accession in GenBank database, I'm looking for the best way to track back those accessions next to the taxonomy assigned to an ASV.

Is there already a solution for tracking that, or could it be implemented?

Thanks in advance!

annotatebio avatar Apr 12 '24 14:04 annotatebio

Hi there,

I actually never looked into that. Therefore I am not aware of a solution or whether it could be implemented. Maybe someone else could chime in.

For more context, could you describe why that would help you/what you would gain? E.g. whats your usecase? Might that be of more general interest as well?

d4straub avatar Apr 15 '24 09:04 d4straub

Thanks for the swift reply! The reason for the request is that while comparing 16S results from databases like SILVA, with whole genome-based methods (with NCBI Taxonomy used), one stumbles upon a problem of outdated taxonomic labels of SILVA and discrepansies between OTUs and genomic taxonomy.

Since SILVA stores information on 16S gene primary accession in GenBank (and from what I see some of the ampliseq database files do that too), it is possible to use it for finding what's the NCBI taxonomy assigned to the gene - which is likely more up to date and in line with whole genomes' taxonomy.

annotatebio avatar Apr 15 '24 09:04 annotatebio

I see. The outdated taxonomies could be probably improved that way. Regarding comparing it with whole genome-based methods, one could either use GTDB via --dada_ref_taxonomy gtdb (the corresponding shotgun metagenomics assembly classifier can be used in e.g. nf-core/mag) or Kraken2 with the standard database using --kraken2_ref_taxonomy standard (which seems to work just fine in preliminary benchmarks).

d4straub avatar Apr 15 '24 10:04 d4straub

Great recommendations!

I've launched GTDB-based classification with DADA2, however I see some crucial taxa are not detected further than phylum/class/family level, even though genus/species level gets assigned to the same ASV with SILVA. It seems to me that the GTDB database have a different content on the sequence level than the default SILVA, and hence the classification results differ?

Would it be the same case for Kraken2 database?

annotatebio avatar Apr 18 '24 09:04 annotatebio

Databases are non-trivial to compare, so if you do not find a "crucial" taxa, turn to another one. Another reason could be that classification with DADA2 is not always same, some taxa close to cutoffs can fall below or raise above said cutoff because of tiny number alterations (that are outside of my control, fixed seed doesnt help). So running the same taxonomic classification with DADA2 multiple times can lead to missing or added taxonomic levels when around the confidence threshold.

Would it be the same case for Kraken2 database?

I dont know, you would need to test.

d4straub avatar Apr 18 '24 09:04 d4straub

IMO, "sbdi-gtdb" is better than "gtdb" as we know there are rRNA-sequences in the GTDB collection that are assigned to the wrong species. "sbdi-gtdb" is phylogenetically vetted to remove these.

erikrikarddaniel avatar Apr 18 '24 13:04 erikrikarddaniel

Thanks for all the suggestions - indeed I gave --kraken2_ref_taxonomy standard and --dada_ref_taxonomy sbdi-gtdb a go and while the latter definitely brought some of the results closer on the genus level, phylum level taxonomy still seems to be a jungle in comparison to whole-genome GTDB: sometimes I see Bacteroidota, but sometimes I see Firmicutes. Proteobacteria should be Pseudomonadota so this label is probably also not up to date.

I didn't expect it to be such a challenge to benchmark the technologies, looks like it requires a lot of manual research to map the taxonomic labels correspondence, otherwise while plotted one next to another the data looks like the results were completely different.

annotatebio avatar Apr 19 '24 09:04 annotatebio

I am benchmarking currently, hope that will shed some light on it.

d4straub avatar Apr 19 '24 09:04 d4straub

Thank you so much Daniel!

Now that I think about it, it may be a matter of different GTDB versions? For full-genome methods, we use the latest 214 release. As far as I can see, SBDI is tied to 207 release, which means I see Bacillota_A in shotgun, but Firmicutes_A in ampliseq results - the major phyla names change is probably not accounted for in v 207?

annotatebio avatar Apr 19 '24 09:04 annotatebio

Thank you so much Daniel!

Now that I think about it, it may be a matter of different GTDB versions? For full-genome methods, we use the latest 214 release. As far as I can see, SBDI is tied to 207 release, which means I see Bacillota_A in shotgun, but Firmicutes_A in ampliseq results - the major phyla names change is probably not accounted for in v 207?

That's it. I'm working on SBDI-GTDB 08RS214, and soon release 09 (when that's released, likely in late April).

erikrikarddaniel avatar Apr 19 '24 09:04 erikrikarddaniel

That's precious. If I set up the repository to track the releases, will it be enough to be notified when it becomes available?

annotatebio avatar Apr 19 '24 09:04 annotatebio

That's precious. If I set up the repository to track the releases, will it be enough to be notified when it becomes available?

New releases of databases are included in new releases of the pipeline itself, so yes. Hopefully, I'm done with the next release in time for Ampliseq 2.10.

erikrikarddaniel avatar Apr 19 '24 10:04 erikrikarddaniel

That would be fantastic - thanks for all the work, I'm hitting the Watch button then :) Good luck!

annotatebio avatar Apr 19 '24 11:04 annotatebio