tRNA taxonomy v226
New version of the tRNA taxonomy using GTDB v226. I ran a quick test and it worked fine. I can also see that the database version's check are working. I also noticed that in the original commit for the first and only version of the tRNA taxonomy database in anvi'o that we would make a README file showing how we generated the data. But, alas, no such README was ever committed.
I will write a short one explaining how I got the data and how I transformed it.
WELL. I just pushed the README.txt to the master branch 🤦
Content of the README:
We used the genome collection of GlobDB v226 (https://globdb.org), which itself contains all genome representatives of GTDB v226. GlobDB already provides contigs-databases with annotated tRNA, through the command anvi-scan-trnas.
We made an external-genomes-txt file with all the GTDB genomes present in GlobDB and used anvi-get-sequences-for-hmm-hits to extract the sequences for all tRNAs:
anvi-get-sequences-for-hmm-hits -e external-genomes.txt --hmm-sources Transfer_RNAs -o all_tRNAs.fa
The headers of that fasta file contains the information about the anticodon and the genome's id:
>Ile_GAT___Transfer_RNAs___4068027d3b59b533bad5e550e79d768a08c97ce5fbceb337f7a6171e bin_id:GCA_000008085|source:Transfer_RNAs|e_value:92.8|contig:hash112713c5_GCA_000008085_000000000001|gene_callers_id:hash112713c5_612|start:7000|stop:7074|length:74
Using that information, we can create as many fasta file as anticodon and only keep the genome's name as the defline:
mkdir -p ANTICODON_SEARCH_DATABASES
awk -v outdir="ANTICODON_SEARCH_DATABASES" '
/^>/ {
# parse anticodon and genome id from header
match($0, />[^_]+_([A-Z]{3})___.*bin_id:([^|]+)/, m)
anticodon = m[1]
genome = m[2]
# print sequence under genome ID
if (anticodon != "" && genome != "") {
fname = outdir "/" anticodon
print ">" genome >> fname
}
next
}
{
print >> fname
}
' all_tRNAs.fa
# gzip all outputs
gzip ANTICODON_SEARCH_DATABASES/*
Finally, we can use the taxonomy file provided in GlobDB to construct the table mapping the genome's name/accession to a taxonomy:
grep "^>" all_tRNAs.fa | sed -E 's/.*bin_id:([^|]+).*/\1/' | sort -u > genome_ids.txt
grep -Ff genome_ids.txt globdb_r226_taxonomy.tsv > ACCESSION_TO_TAXONOMY.txt
gzip ACCESSION_TO_TAXONOMY.txt && rm genome_ids.txt
WELL. I just pushed the README.txt to the master branch 🤦
LOL 😂
Thank you very much for this, @FlorianTrigodet!
I think it would be great if we could translate your README.txt into something that could go into the help docs for anvi-estimate-trna-taxonomy under a section, say, "Notes for Advanced Users and Programmers" where we describe how the underlying data for this can be updated.
Plus, what do you mean by that you tested and it worked? Does it mean you run the old and new database on a few contigs-db files do you get reasonably comparable results? :)
I ran the old an new version on a metagenomic assembly (human gut) and from a quick glance at a few dozen annotations (out of >1200 annotation for AUG) the results are relatively comparable. I cannot do a simple diff because GTDB changed quite a few phylum name since v89. Here is the file that compare the previous and new annotation for each tRNA CAT (start codon AUG): tRNA_CAT_v89_vs_v226.txt
As for moving the readme to the documentation, should we move it to anvi-setup-trna-taxonomy or anvi-run-trna-taxonomy?
I had suggested anvi-estimate-trna-taxonomy, but I think anvi-setup-trna-taxonomy would be much better to keep that information I guess :)