gffread icon indicating copy to clipboard operation
gffread copied to clipboard

CDS gets filtered out and its attributes overwrite attributes for another CDS in the same gene

Open hermidalc opened this issue 2 years ago • 2 comments

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/845/545/GCA_000845545.1_ViralProj14434/GCA_000845545.1_ViralProj14434_genomic.gff.gz

When I run with -F --keep-exon-attrs to show what happens, these lines:

AF234533.1	Genbank	gene	1405	2262	.	+	.	ID=gene-P;Name=P;gbkey=Gene;gene=P;gene_biotype=protein_coding
AF234533.1	Genbank	mRNA	1405	2262	.	+	.	ID=rna-P;Parent=gene-P;gbkey=mRNA;gene=P;product=polymerase-associated protein P
AF234533.1	Genbank	exon	1405	2262	.	+	.	ID=exon-P-1;Parent=rna-P;gbkey=mRNA;gene=P;product=polymerase-associated protein P
AF234533.1	Genbank	CDS	1415	2251	.	+	0	ID=cds-AAG10410.1;Parent=rna-P;Dbxref=NCBI_GP:AAG10410.1;Name=AAG10410.1;gbkey=CDS;gene=P;product=polymerase-associated protein P;protein_id=AAG10410.1
AF234533.1	Genbank	CDS	1692	1838	.	+	0	ID=cds-AAG10411.1;Parent=rna-P;Dbxref=NCBI_GP:AAG10411.1;Name=AAG10411.1;gbkey=CDS;gene=P;product=protein P';protein_id=AAG10411.1

get converted into this below, where CDS 1692..1838 get filtered out for some reason, and its attributes overwrite CDS 1405..2262 which is for a different protein (same gene):

AF234533.1	Genbank	mRNA	1405	2262	.	+	.	ID=rna-P;geneID=gene-P;gene_name=P;gbkey=mRNA;gene=P;product=polymerase-associated protein P;Name=P;gene_biotype=protein_coding
AF234533.1	Genbank	exon	1405	2262	.	+	.	Parent=rna-P;gbkey=mRNA;gene=P;product=polymerase-associated protein P
AF234533.1	Genbank	CDS	1415	2251	.	+	0	Parent=rna-P;Dbxref=NCBI_GP:AAG10411.1;Name=AAG10411.1;gbkey=CDS;gene=P;product=protein P';protein_id=AAG10411.1

hermidalc avatar Nov 16 '22 18:11 hermidalc

And in the same file again here:

AF234533.1	Genbank	gene	6816	7453	.	+	.	ID=gene-alpha;Name=alpha;gbkey=Gene;gene=alpha;gene_biotype=protein_coding
AF234533.1	Genbank	mRNA	6816	7453	.	+	.	ID=rna-alpha;Parent=gene-alpha;gbkey=mRNA;gene=alpha;product=alpha 1 protein
AF234533.1	Genbank	exon	6816	7453	.	+	.	ID=exon-alpha-1;Parent=rna-alpha;gbkey=mRNA;gene=alpha;product=alpha 1 protein
AF234533.1	Genbank	CDS	6824	7090	.	+	0	ID=cds-AAG10415.1;Parent=rna-alpha;Dbxref=NCBI_GP:AAG10415.1;Name=AAG10415.1;gbkey=CDS;gene=alpha;product=alpha 1 protein;protein_id=AAG10415.1
AF234533.1	Genbank	CDS	7092	7442	.	+	0	ID=cds-AAG10416.1;Parent=rna-alpha;Dbxref=NCBI_GP:AAG10416.1;Name=AAG10416.1;gbkey=CDS;gene=alpha;product=alpha 2 protein;protein_id=AAG10416.1
AF234533.1	Genbank	CDS	7166	7321	.	+	0	ID=cds-AAG10417.1;Parent=rna-alpha;Dbxref=NCBI_GP:AAG10417.1;Name=AAG10417.1;gbkey=CDS;gene=alpha;product=alpha 3 protein;protein_id=AAG10417.1

turns into this where again a CDS disappears and its attributes overwrite another CDS for a different protein (same gene):

AF234533.1	Genbank	mRNA	6816	7453	.	+	.	ID=rna-alpha;geneID=gene-alpha;gene_name=alpha;gbkey=mRNA;gene=alpha;product=alpha 1 protein;Name=alpha;gene_biotype=protein_coding
AF234533.1	Genbank	exon	6816	7453	.	+	.	Parent=rna-alpha;gbkey=mRNA;gene=alpha;product=alpha 1 protein
AF234533.1	Genbank	CDS	6824	7090	.	+	0	Parent=rna-alpha;Dbxref=NCBI_GP:AAG10415.1;Name=AAG10415.1;gbkey=CDS;gene=alpha;product=alpha 1 protein;protein_id=AAG10415.1
AF234533.1	Genbank	CDS	7092	7442	.	+	0	Parent=rna-alpha;Dbxref=NCBI_GP:AAG10417.1;Name=AAG10417.1;gbkey=CDS;gene=alpha;product=alpha 3 protein;protein_id=AAG10417.1

hermidalc avatar Nov 16 '22 18:11 hermidalc

interesting - thank you for these reports related to parsing failures of this viral annotation.

In this particular case I can see that some CDSs are dropped when they are found to be "contained" in another CDS or having large overlaps with another CDS from the same transcript (so it's not a programmed ribosomal shift, which would be part of the same CDS).

It seems the error stems from the assumption that there can be at most one CDS (one chain of CDS segments) per transcript ID (i.e. one protein per transcript), which is clearly not the case in this annotation -- each CDS segment here, even though parented by the same transcript ID, seems to be a distinct coding sequence and thus leading to different protein products (!).

Currently the transcript data structure I am using only keeps track of one CDS segment chain per transcript, and changing that will be quite impactful in other downstream code I am using. Perhaps the easiest/least impactful workaround would be for the parser to emit two transcripts in such cases (and thus treat these distinct CDSs as distinct transcript "isoforms" from the same gene..). This workaround might actually be useful with other bioinformatics software that may have also assumed "one transcript => one protein".

gpertea avatar Nov 16 '22 19:11 gpertea