bakta icon indicating copy to clipboard operation
bakta copied to clipboard

`TranslationError` with user-provided regions (`--regions`) for selenocysteine proteins (TGA codon)

Open refack opened this issue 2 months ago • 5 comments

Bug Description

I am annotating a Pseudomonas aeruginosa strain (BWH056) that translates the TGA codon to Selenocysteine.

When I provide a pre-annotated GenBank file (using --regions) that contains a CDS with an in-frame TGA codon, Bakta fails during the validation step.

It seems the validation logic for user-provided CDS features does not apply the same selenocysteine detection logic that the main Bakta prediction pipeline uses (which was discussed in Issue #93). Instead, it appears to strictly apply translation table 11, finds the TGA, and incorrectly flags it as a premature stop codon.

Expected Behavior

Bakta should successfully validate the user-provided CDS, recognizing that the in-frame TGA codon is intended to be translated as Selenocysteine, especially when the necessary SECIS elements are present in the genome.

Actual Behavior

Bakta exits with a Bio.Data.CodonTable.TranslationError.

Traceback

17:01:47.927 - ERROR - CDS - user-provided CDS: regions/features file GenBank format not valid!
Traceback (most recent call last):
  File "/home/refack/miniforge3/envs/bakta/lib/python3.11/site-packages/bakta/features/cds.py", line 322, in import_user_cdss
    aa = str(Seq(nt).translate(table=cfg.translation_table, cds=True))
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/refack/miniforge3/envs/bakta/lib/python3.11/site-packages/Bio/Seq.py", line 1620, in translate
    _translate_str(str(self), table, stop_symbol, to_stop, cds, gap=gap)
  File "/home/refack/miniforge3/envs/bakta/lib/python3.11/site-packages/Bio/Seq.py", line 2898, in _translate_str
    raise CodonTable.TranslationError(
Bio.Data.CodonTable.TranslationError: Extra in frame stop codon 'TGA' found.

Steps to Reproduce

Download the RefSeq genome (FASTA) and corresponding RefSeq annotation (GenBank or GFF format) for Pseudomonas aeruginosa BWH056, assembly GCF_000629465.1.

The annotation file (e.g., GCF_000629465.1_cds_from_genomic.fna or the full GenBank file) will contain CDS features with in-frame TGA codons.

Run Bakta using the --regions flag to provide this annotation file.

# (Example command, filenames may vary)
bakta --db /path/to/db \
      --regions GCF_000629465.1_genomic.gbff \
      --genome GCF_000629465.1_genomic.fna

refack avatar Nov 14 '25 22:11 refack

For posterity: Gemini (2.5 pro) and I had the same idea for a workaround; just delete this gene and CDS from a local version of GCF_000629465.1_genomic.gbff

refack avatar Nov 14 '25 22:11 refack

Hi @refack thanks for reaching out and this detailed description. Indeed, this should be accepted by Bakta. I'll add it to my ToDos and try to come up with a fix for this in the upcoming v1.12.0.

oschwengers avatar Nov 26 '25 14:11 oschwengers

OK, I had a deeper look at/thought on it. The key issues here is, that at the point of importing the user-provided CDS regions, Bakta has not yet any information about SECIS features. Thus, it cannot be sure if this is indeed a selenocysteine-using CDS or just a falsely provided CDS coordinate.

We could simply define that in cases of user-provided in-frame TGAs, Bakta expects that this is provided on purpose and handles those as non-stop but selenocysteine codons. Otherwise, the internal stop codon would cause issues downstream in the alignment/functional annotation processes.

However, this might also create some false-positive SEL proteins...

Is there a reason why you provide those regions? Are those selenocysteine-proteins not properly detected & annotated by Bakta without the reference annotation?

oschwengers avatar Nov 27 '25 10:11 oschwengers

The RefSeq of GCF_000629465.1 has this annotation In the gbff:

                     /transl_except=(pos:complement(746613..746615),aa:Sec)

in the gff:

NZ_KK213203.1	Protein Homology	CDS	744123	747203	.	-	0	ID=cds-WP_010895684.1;Parent=gene-V557_RS21465;Dbxref=GenBank:WP_010895684.1;Name=WP_010895684.1;Ontology_term=GO:0015942,GO:0008863,GO:0009055,GO:0043546,GO:0047111,GO:0009326;gbkey=CDS;gene=fdnG;go_component=formate dehydrogenase complex|0009326||IEA;go_function=formate dehydrogenase (NAD+) activity|0008863||IEA,electron transfer activity|0009055||IEA,molybdopterin cofactor binding|0043546||IEA,formate dehydrogenase (cytochrome-c-553) activity|0047111||IEA;go_process=formate metabolic process|0015942||IEA;inference=COORDINATES: similar to AA sequence:RefSeq:YP_003933614.1;locus_tag=V557_RS21465;product=formate dehydrogenase-N subunit alpha;protein_id=WP_010895684.1;transl_except=(pos:complement(746613..746615)%2Caa:Sec);transl_table=11

So the data is there 🤷

Also the /translation=" tag has a U in that position.

refack avatar Nov 27 '25 21:11 refack

BTW: have you considered a GFF3 parser (scikit.bio is OKish), or parsing .gbff files with biopython?

refack avatar Nov 27 '25 21:11 refack