gffcompare icon indicating copy to clipboard operation
gffcompare copied to clipboard

Error running gffcompare

Open lakhujanivijay opened this issue 7 years ago • 3 comments

I am trying to run the following command:

gffcompare -r ref.gff -G -o merged stringtie-merged.gtf

ref.gff - downloaded from NCBI

string-merged.gtf - obtained from stringtie --merge command

Error encountered : GFF Error: overlapping duplicate transcript feature (ID=gene29892)

When I grep "gene29892" from both the ref.gff and merged stringtie-merged.gtf

From ref.gff

NC_007957.1	RefSeq	gene	74631	74744	.	+	.	ID=gene29892;Dbxref=GeneID:4025012;Name=rps12;exception=trans-splicing;gbkey=Gene;gene=rps12;gene_biotype=protein_coding;locus_tag=ViviCp045;part=1/2
NC_007957.1	RefSeq	gene	146276	147073	.	+	.	ID=gene29892;Dbxref=GeneID:4025012;Name=rps12;exception=trans-splicing;gbkey=Gene;gene=rps12;gene_biotype=protein_coding;locus_tag=ViviCp045;part=2/2
NC_007957.1	RefSeq	CDS	74631	74744	.	+	0	ID=cds41168;Parent=gene29892;Dbxref=Genbank:YP_567100.1,GeneID:4025012;Name=YP_567100.1;exception=trans-splicing;gbkey=CDS;gene=rps12;product=ribosomal protein S12;protein_id=YP_567100.1;transl_table=11
NC_007957.1	RefSeq	CDS	146276	146507	.	+	0	ID=cds41168;Parent=gene29892;Dbxref=Genbank:YP_567100.1,GeneID:4025012;Name=YP_567100.1;exception=trans-splicing;gbkey=CDS;gene=rps12;product=ribosomal protein S12;protein_id=YP_567100.1;transl_table=11
NC_007957.1	RefSeq	CDS	147048	147073	.	+	2	ID=cds41168;Parent=gene29892;Dbxref=Genbank:YP_567100.1,GeneID:4025012;Name=YP_567100.1;exception=trans-splicing;gbkey=CDS;gene=rps12;product=ribosomal protein S12;protein_id=YP_567100.1;transl_table=11
NC_007957.1	RefSeq	exon	146276	146507	.	+	.	ID=id318095;Parent=gene29892;Dbxref=GeneID:4025012;exon_number=1;gbkey=exon;gene=rps12
NC_007957.1	RefSeq	exon	147048	147073	.	+	.	ID=id318096;Parent=gene29892;Dbxref=GeneID:4025012;exon_number=2;gbkey=exon;gene=rps12

From stringtie-merged.gtf

NC_007957.1	StringTie	transcript	74631	147073	1000	+	.	gene_id "MSTRG.117"; transcript_id "gene29892"; gene_name "rps12"; ref_gene_id "gene29892"; 
NC_007957.1	StringTie	exon	74631	74744	1000	+	.	gene_id "MSTRG.117"; transcript_id "gene29892"; exon_number "1"; gene_name "rps12"; ref_gene_id "gene29892"; 
NC_007957.1	StringTie	exon	146276	146507	1000	+	.	gene_id "MSTRG.117"; transcript_id "gene29892"; exon_number "2"; gene_name "rps12"; ref_gene_id "gene29892"; 
NC_007957.1	StringTie	exon	147048	147073	1000	+	.	gene_id "MSTRG.117"; transcript_id "gene29892"; exon_number "3"; gene_name "rps12"; ref_gene_id "gene29892"; 
NC_007957.1	StringTie	transcript	146276	147073	1000	+	.	gene_id "MSTRG.117"; transcript_id "gene29892"; gene_name "rps12"; ref_gene_id "gene29892"; 
NC_007957.1	StringTie	exon	146276	147073	1000	+	.	gene_id "MSTRG.117"; transcript_id "gene29892"; exon_number "1"; gene_name "rps12"; ref_gene_id "gene29892"; 

What could be possibly wrong as I cannot see any duplicate values!

lakhujanivijay avatar Apr 18 '17 11:04 lakhujanivijay

Really? Are you really going to argue that you just don't see a duplicate ID gene29892 in those first two lines of your ref.gff quote you're showing here ? :)

While the error message might not be very accurate (there is no actual overlap there, for those two duplicated gene IDs), the problem still stands: the ID gene29892 has obvious transcript properties (parenting CDS and exon features) while it is being declared twice in the file (1st at 74631-74744, 2nd time at 146276-147073). This is not a valid GFF3 format, as my understanding is that the feature IDs should be unique.. (though indeed my GFF parser is a bit more lenient about this, accepting non-unique IDs if the features are on separate reference sequences.. which is not the case here).

Then things went downhill ("downstream") from there.. according to the merciless GIGO principle.. I hope you understand, it's hard to somehow automagically fix invalid input reference annotation data, or guess what the authors of those data really meant there -- e.g. that it was somehow a 2-part gene there (?!), so it's OK if they broke the ID uniqueness rule on a whim..

gpertea avatar Apr 18 '17 16:04 gpertea

Leaving aside the cheeky fun I had with my reply above, it turns out that in fact I was in the wrong there -- not about the duplicate IDs, but about the validity of that annotation, as this is a special case of trans-splicing where actually the current GFF3 specification does allow for the same ID in the case of discontinuous features like trans-splicing and fusions. So please accept my belated apologies for the incorrect/incomplete answer -- and thanks for this trans-splicing example! I ran into this old closed issue while looking for trans-splicing examples so I can add trans-splicing support to my GFF parser (and thus to gffread, gffcompare etc.).

gpertea avatar May 22 '18 18:05 gpertea

Hello Geo!

Glad to hear that. It's good that we are coming across such scenarios and I am happy that the project is on constant development. Taking it on a positive note.

Your hard work is much appreciated!

Regards Vijay

lakhujanivijay avatar May 23 '18 04:05 lakhujanivijay