prokka icon indicating copy to clipboard operation
prokka copied to clipboard

mRNA created for misc_RNA or tRNA or RNA and reported as duplicated by the ENA validator

Open Jeepee8820 opened this issue 4 years ago • 3 comments

Hi tseemann,

Thanks for the great tool. I am experiencing an issue with validating files for submission to ENA after conversion with EMBLmyGFF3 which seems to be related to an issue in the output generated by Prokka. Indeed, the misc_RNA or tRNA or RNA records (perhaps others too) are also being assigned a mRNA at the exact same location which generate some duplicated features that the ENA validation tool is complaining about. Please see the last posts in this thread for more details https://github.com/NBISweden/EMBLmyGFF3/issues/33 Do you see any possibility to fix this issue? Thanks in advance

Jeepee8820 avatar Jun 25 '20 10:06 Jeepee8820

Bacterial genomes should not have mRNA features anywyay; they only use gene and CDS. Don't use the --mrna switch.

The chance of a prokka annotated genome being accepted by ENA is close to 0%. They became very strict a few years back. The preferred process is to use PGAP and submit to NCBI, or just submit contigs to NCBI and tick :heavy_check_mark: to let them annotate with PGAP.

But you are right that those RNA features should not get a mRNA for them when using --mrna. This is a bug.

tseemann avatar Jun 25 '20 22:06 tseemann

Hello, First I do not agree, prokaryotes have mRNA. If the mRNA feature is not present, processing the data with agat_sp_fix_features_locations_duplicated.pl to remove the duplicated locations will add the mRNA features. Secondly EMBLmyGFF3 is perfectly suitable to convert prokka annotation into EMBL file that is the submittable file to ENA.

Juke34 avatar Sep 26 '20 08:09 Juke34

I second Juke34 here for two reasons:

  1. removing mRNAs from prokaryotes messes with interoperability with eukaryote gffs in downstream software, it's poor file formatting standardization. The GFF is a file format and should be consistent regardless of the kingdom
  2. from a biological standpoint they are translated and I think mRNAs is a suitable delineation since it already exists. Yes, most of the time it is a redundant field with genes, but they are necessary. having no RNA does not reflect the biology. In eukaryote gffs, not having an mRNA means the gene isn't transcribed.

Tl;dr not including mRNAs is bad for file format standardization across kingdoms and biological information delineation.

xonq avatar Sep 22 '22 16:09 xonq