gffread icon indicating copy to clipboard operation
gffread copied to clipboard

How is mixed-strand trans-splicing handled?

Open fruce-ki opened this issue 6 years ago • 4 comments

I've been using an old version of gffread bundled with cufflinks 2.2.1 to create the transcript sequences given a GTF and genome FASTA. Recently I stumbled on trans-splicing from transcripts originating in opposite strands and realised that gffread was creating two separate sequences out of such annotations, causing problems downstream in my analyses. I was wondering if this is something that has been fixed since that old version.

Here are a couple of such annotation examples in GTF:

3R	protein_coding	exon	17202324	17202463	.	-	.	 gene_id "FBgn0002781"; transcript_id "FBtr0084077"; exon_number "1"; gene_name "mod(mdg4)"; gene_biotype "protein_coding"; transcript_name "mod(mdg4)-RR";
3R	protein_coding	exon	17177331	17177608	.	+	.	 gene_id "FBgn0002781"; transcript_id "FBtr0084077"; exon_number "2"; gene_name "mod(mdg4)"; gene_biotype "protein_coding"; transcript_name "mod(mdg4)-RR";
3R	protein_coding	exon	17203010	17203121	.	-	.	 gene_id "FBgn0002781"; transcript_id "FBtr0084077"; exon_number "3"; gene_name "mod(mdg4)"; gene_biotype "protein_coding"; transcript_name "mod(mdg4)-RR";
3R	protein_coding	exon	17202541	17202798	.	-	.	 gene_id "FBgn0002781"; transcript_id "FBtr0084077"; exon_number "4"; gene_name "mod(mdg4)"; gene_biotype "protein_coding"; transcript_name "mod(mdg4)-RR";
3R	protein_coding	start_codon	17202752	17202754	.	-	0	 gene_id "FBgn0002781"; transcript_id "FBtr0084077"; exon_number "4"; gene_name "mod(mdg4)"; gene_biotype "protein_coding"; transcript_name "mod(mdg4)-RR";
3R	protein_coding	exon	17200782	17201634	.	-	.	 gene_id "FBgn0002781"; transcript_id "FBtr0084077"; exon_number "5"; gene_name "mod(mdg4)"; gene_biotype "protein_coding"; transcript_name "mod(mdg4)-RR";


3R	protein_coding	exon	17203010	17203121	.	-	.	 gene_id "FBgn0002781"; transcript_id "FBtr0084060"; exon_number "1"; gene_name "mod(mdg4)"; gene_biotype "protein_coding"; transcript_name "mod(mdg4)-RA";
3R	protein_coding	exon	17202541	17202798	.	-	.	 gene_id "FBgn0002781"; transcript_id "FBtr0084060"; exon_number "2"; gene_name "mod(mdg4)"; gene_biotype "protein_coding"; transcript_name "mod(mdg4)-RA";
3R	protein_coding	CDS	17202541	17202754	.	-	0	 gene_id "FBgn0002781"; transcript_id "FBtr0084060"; exon_number "2"; gene_name "mod(mdg4)"; gene_biotype "protein_coding"; transcript_name "mod(mdg4)-RA"; protein_id "FBpp0083459";
3R	protein_coding	start_codon	17202752	17202754	.	-	0	 gene_id "FBgn0002781"; transcript_id "FBtr0084060"; exon_number "2"; gene_name "mod(mdg4)"; gene_biotype "protein_coding"; transcript_name "mod(mdg4)-RA";
3R	protein_coding	exon	17202324	17202463	.	-	.	 gene_id "FBgn0002781"; transcript_id "FBtr0084060"; exon_number "3"; gene_name "mod(mdg4)"; gene_biotype "protein_coding"; transcript_name "mod(mdg4)-RA";
3R	protein_coding	CDS	17202324	17202463	.	-	2	 gene_id "FBgn0002781"; transcript_id "FBtr0084060"; exon_number "3"; gene_name "mod(mdg4)"; gene_biotype "protein_coding"; transcript_name "mod(mdg4)-RA"; protein_id "FBpp0083459";
3R	protein_coding	exon	17200782	17201634	.	-	.	 gene_id "FBgn0002781"; transcript_id "FBtr0084060"; exon_number "4"; gene_name "mod(mdg4)"; gene_biotype "protein_coding"; transcript_name "mod(mdg4)-RA";
3R	protein_coding	CDS	17200782	17201634	.	-	0	 gene_id "FBgn0002781"; transcript_id "FBtr0084060"; exon_number "4"; gene_name "mod(mdg4)"; gene_biotype "protein_coding"; transcript_name "mod(mdg4)-RA"; protein_id "FBpp0083459";
3R	protein_coding	exon	17177761	17178358	.	+	.	 gene_id "FBgn0002781"; transcript_id "FBtr0084060"; exon_number "5"; gene_name "mod(mdg4)"; gene_biotype "protein_coding"; transcript_name "mod(mdg4)-RA";
3R	protein_coding	CDS	17177761	17178011	.	+	0	 gene_id "FBgn0002781"; transcript_id "FBtr0084060"; exon_number "5"; gene_name "mod(mdg4)"; gene_biotype "protein_coding"; transcript_name "mod(mdg4)-RA"; protein_id "FBpp0083459";

fruce-ki avatar May 07 '18 10:05 fruce-ki

Indeed this hasn't been fixed even in the current version, but it definitely should be addressed. I'll take a look at this soon.

gpertea avatar May 07 '18 14:05 gpertea

Extending this all the way to distant and inter-chromosomal trans-splicing would break the locality assumptions (the "locus" concept). Clustering transcripts into loci would become messy without splitting these distantly trans-spliced transcripts, like I've been doing.. I guess the way to make gffread work by default is to preserve trans-splicing for simple operations like conversions, filters and transcript sequence extraction, but still fall back to the old transcript splitting for any clustering and locus-based operations.. (this is more of a commentary/note to self)

gpertea avatar May 11 '18 19:05 gpertea

The example here was from trans-splicing between sense and antisense transcripts at the same locus. For the record, however, it originates in an old version of the annotation and is not present in newer versions of it, so it was likely an error. This could be more of a theoretical bug than a practical one, but I do not know anything about trans-splicing, I just stumbled on it as a bug in my workflow.

fruce-ki avatar May 11 '18 19:05 fruce-ki

Has this issue been addressed? I am working with lncRNA that trans-splice, sense and anti-sense, from the same mitochondrial gene.

jpetittowpi avatar Oct 08 '20 16:10 jpetittowpi