anvio tRNA taxonomy v226

New version of the tRNA taxonomy using GTDB v226. I ran a quick test and it worked fine. I can also see that the database version's check are working. I also noticed that in the original commit for the first and only version of the tRNA taxonomy database in anvi'o that we would make a README file showing how we generated the data. But, alas, no such README was ever committed.

I will write a short one explaining how I got the data and how I transformed it.

Nov 10 '25 13:11 FlorianTrigodet

WELL. I just pushed the README.txt to the master branch 🤦

Nov 10 '25 13:11 FlorianTrigodet

Content of the README:

We used the genome collection of GlobDB v226 (https://globdb.org), which itself contains all genome representatives of GTDB v226. GlobDB already provides contigs-databases with annotated tRNA, through the command anvi-scan-trnas.

We made an external-genomes-txt file with all the GTDB genomes present in GlobDB and used anvi-get-sequences-for-hmm-hits to extract the sequences for all tRNAs:

anvi-get-sequences-for-hmm-hits -e external-genomes.txt --hmm-sources Transfer_RNAs -o all_tRNAs.fa

The headers of that fasta file contains the information about the anticodon and the genome's id:

>Ile_GAT___Transfer_RNAs___4068027d3b59b533bad5e550e79d768a08c97ce5fbceb337f7a6171e bin_id:GCA_000008085|source:Transfer_RNAs|e_value:92.8|contig:hash112713c5_GCA_000008085_000000000001|gene_callers_id:hash112713c5_612|start:7000|stop:7074|length:74

Using that information, we can create as many fasta file as anticodon and only keep the genome's name as the defline:

mkdir -p ANTICODON_SEARCH_DATABASES

awk -v outdir="ANTICODON_SEARCH_DATABASES" '
    /^>/ {
        # parse anticodon and genome id from header
        match($0, />[^_]+_([A-Z]{3})___.*bin_id:([^|]+)/, m)
        anticodon = m[1]
        genome = m[2]
        # print sequence under genome ID
        if (anticodon != "" && genome != "") {
            fname = outdir "/" anticodon
            print ">" genome >> fname
        }
        next
    }
    {
        print >> fname
    }
' all_tRNAs.fa

# gzip all outputs
gzip ANTICODON_SEARCH_DATABASES/*

Finally, we can use the taxonomy file provided in GlobDB to construct the table mapping the genome's name/accession to a taxonomy:

grep "^>" all_tRNAs.fa | sed -E 's/.*bin_id:([^|]+).*/\1/' | sort -u > genome_ids.txt
grep -Ff genome_ids.txt globdb_r226_taxonomy.tsv > ACCESSION_TO_TAXONOMY.txt
gzip ACCESSION_TO_TAXONOMY.txt && rm genome_ids.txt

Nov 10 '25 13:11 FlorianTrigodet

WELL. I just pushed the README.txt to the master branch 🤦

LOL 😂

Nov 10 '25 13:11 ivagljiva

Thank you very much for this, @FlorianTrigodet!

I think it would be great if we could translate your README.txt into something that could go into the help docs for anvi-estimate-trna-taxonomy under a section, say, "Notes for Advanced Users and Programmers" where we describe how the underlying data for this can be updated.

Nov 10 '25 15:11 meren

Plus, what do you mean by that you tested and it worked? Does it mean you run the old and new database on a few contigs-db files do you get reasonably comparable results? :)

Nov 10 '25 15:11 meren

I ran the old an new version on a metagenomic assembly (human gut) and from a quick glance at a few dozen annotations (out of >1200 annotation for AUG) the results are relatively comparable. I cannot do a simple diff because GTDB changed quite a few phylum name since v89. Here is the file that compare the previous and new annotation for each tRNA CAT (start codon AUG): tRNA_CAT_v89_vs_v226.txt

As for moving the readme to the documentation, should we move it to anvi-setup-trna-taxonomy or anvi-run-trna-taxonomy?

Nov 11 '25 08:11 FlorianTrigodet

I had suggested anvi-estimate-trna-taxonomy, but I think anvi-setup-trna-taxonomy would be much better to keep that information I guess :)

Nov 11 '25 09:11 meren