`TranslationError` with user-provided regions (`--regions`) for selenocysteine proteins (TGA codon)
Bug Description
I am annotating a Pseudomonas aeruginosa strain (BWH056) that translates the TGA codon to Selenocysteine.
When I provide a pre-annotated GenBank file (using --regions) that contains a CDS with an in-frame TGA codon, Bakta fails during the validation step.
It seems the validation logic for user-provided CDS features does not apply the same selenocysteine detection logic that the main Bakta prediction pipeline uses (which was discussed in Issue #93). Instead, it appears to strictly apply translation table 11, finds the TGA, and incorrectly flags it as a premature stop codon.
Expected Behavior
Bakta should successfully validate the user-provided CDS, recognizing that the in-frame TGA codon is intended to be translated as Selenocysteine, especially when the necessary SECIS elements are present in the genome.
Actual Behavior
Bakta exits with a Bio.Data.CodonTable.TranslationError.
Traceback
17:01:47.927 - ERROR - CDS - user-provided CDS: regions/features file GenBank format not valid!
Traceback (most recent call last):
File "/home/refack/miniforge3/envs/bakta/lib/python3.11/site-packages/bakta/features/cds.py", line 322, in import_user_cdss
aa = str(Seq(nt).translate(table=cfg.translation_table, cds=True))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/refack/miniforge3/envs/bakta/lib/python3.11/site-packages/Bio/Seq.py", line 1620, in translate
_translate_str(str(self), table, stop_symbol, to_stop, cds, gap=gap)
File "/home/refack/miniforge3/envs/bakta/lib/python3.11/site-packages/Bio/Seq.py", line 2898, in _translate_str
raise CodonTable.TranslationError(
Bio.Data.CodonTable.TranslationError: Extra in frame stop codon 'TGA' found.
Steps to Reproduce
Download the RefSeq genome (FASTA) and corresponding RefSeq annotation (GenBank or GFF format) for Pseudomonas aeruginosa BWH056, assembly GCF_000629465.1.
The annotation file (e.g., GCF_000629465.1_cds_from_genomic.fna or the full GenBank file) will contain CDS features with in-frame TGA codons.
Run Bakta using the --regions flag to provide this annotation file.
# (Example command, filenames may vary)
bakta --db /path/to/db \
--regions GCF_000629465.1_genomic.gbff \
--genome GCF_000629465.1_genomic.fna
For posterity: Gemini (2.5 pro) and I had the same idea for a workaround; just delete this gene and CDS from a local version of GCF_000629465.1_genomic.gbff
Hi @refack thanks for reaching out and this detailed description. Indeed, this should be accepted by Bakta. I'll add it to my ToDos and try to come up with a fix for this in the upcoming v1.12.0.
OK, I had a deeper look at/thought on it. The key issues here is, that at the point of importing the user-provided CDS regions, Bakta has not yet any information about SECIS features. Thus, it cannot be sure if this is indeed a selenocysteine-using CDS or just a falsely provided CDS coordinate.
We could simply define that in cases of user-provided in-frame TGAs, Bakta expects that this is provided on purpose and handles those as non-stop but selenocysteine codons. Otherwise, the internal stop codon would cause issues downstream in the alignment/functional annotation processes.
However, this might also create some false-positive SEL proteins...
Is there a reason why you provide those regions? Are those selenocysteine-proteins not properly detected & annotated by Bakta without the reference annotation?
The RefSeq of GCF_000629465.1 has this annotation In the gbff:
/transl_except=(pos:complement(746613..746615),aa:Sec)
in the gff:
NZ_KK213203.1 Protein Homology CDS 744123 747203 . - 0 ID=cds-WP_010895684.1;Parent=gene-V557_RS21465;Dbxref=GenBank:WP_010895684.1;Name=WP_010895684.1;Ontology_term=GO:0015942,GO:0008863,GO:0009055,GO:0043546,GO:0047111,GO:0009326;gbkey=CDS;gene=fdnG;go_component=formate dehydrogenase complex|0009326||IEA;go_function=formate dehydrogenase (NAD+) activity|0008863||IEA,electron transfer activity|0009055||IEA,molybdopterin cofactor binding|0043546||IEA,formate dehydrogenase (cytochrome-c-553) activity|0047111||IEA;go_process=formate metabolic process|0015942||IEA;inference=COORDINATES: similar to AA sequence:RefSeq:YP_003933614.1;locus_tag=V557_RS21465;product=formate dehydrogenase-N subunit alpha;protein_id=WP_010895684.1;transl_except=(pos:complement(746613..746615)%2Caa:Sec);transl_table=11
So the data is there 🤷
Also the /translation=" tag has a U in that position.
BTW: have you considered a GFF3 parser (scikit.bio is OKish), or parsing .gbff files with biopython?