gffcompare
gffcompare copied to clipboard
Error running gffcompare
I am trying to run the following command:
gffcompare -r ref.gff -G -o merged stringtie-merged.gtf
ref.gff - downloaded from NCBI
string-merged.gtf - obtained from stringtie --merge command
Error encountered : GFF Error: overlapping duplicate transcript feature (ID=gene29892)
When I grep "gene29892" from both the ref.gff and merged stringtie-merged.gtf
From ref.gff
NC_007957.1 RefSeq gene 74631 74744 . + . ID=gene29892;Dbxref=GeneID:4025012;Name=rps12;exception=trans-splicing;gbkey=Gene;gene=rps12;gene_biotype=protein_coding;locus_tag=ViviCp045;part=1/2
NC_007957.1 RefSeq gene 146276 147073 . + . ID=gene29892;Dbxref=GeneID:4025012;Name=rps12;exception=trans-splicing;gbkey=Gene;gene=rps12;gene_biotype=protein_coding;locus_tag=ViviCp045;part=2/2
NC_007957.1 RefSeq CDS 74631 74744 . + 0 ID=cds41168;Parent=gene29892;Dbxref=Genbank:YP_567100.1,GeneID:4025012;Name=YP_567100.1;exception=trans-splicing;gbkey=CDS;gene=rps12;product=ribosomal protein S12;protein_id=YP_567100.1;transl_table=11
NC_007957.1 RefSeq CDS 146276 146507 . + 0 ID=cds41168;Parent=gene29892;Dbxref=Genbank:YP_567100.1,GeneID:4025012;Name=YP_567100.1;exception=trans-splicing;gbkey=CDS;gene=rps12;product=ribosomal protein S12;protein_id=YP_567100.1;transl_table=11
NC_007957.1 RefSeq CDS 147048 147073 . + 2 ID=cds41168;Parent=gene29892;Dbxref=Genbank:YP_567100.1,GeneID:4025012;Name=YP_567100.1;exception=trans-splicing;gbkey=CDS;gene=rps12;product=ribosomal protein S12;protein_id=YP_567100.1;transl_table=11
NC_007957.1 RefSeq exon 146276 146507 . + . ID=id318095;Parent=gene29892;Dbxref=GeneID:4025012;exon_number=1;gbkey=exon;gene=rps12
NC_007957.1 RefSeq exon 147048 147073 . + . ID=id318096;Parent=gene29892;Dbxref=GeneID:4025012;exon_number=2;gbkey=exon;gene=rps12
From stringtie-merged.gtf
NC_007957.1 StringTie transcript 74631 147073 1000 + . gene_id "MSTRG.117"; transcript_id "gene29892"; gene_name "rps12"; ref_gene_id "gene29892";
NC_007957.1 StringTie exon 74631 74744 1000 + . gene_id "MSTRG.117"; transcript_id "gene29892"; exon_number "1"; gene_name "rps12"; ref_gene_id "gene29892";
NC_007957.1 StringTie exon 146276 146507 1000 + . gene_id "MSTRG.117"; transcript_id "gene29892"; exon_number "2"; gene_name "rps12"; ref_gene_id "gene29892";
NC_007957.1 StringTie exon 147048 147073 1000 + . gene_id "MSTRG.117"; transcript_id "gene29892"; exon_number "3"; gene_name "rps12"; ref_gene_id "gene29892";
NC_007957.1 StringTie transcript 146276 147073 1000 + . gene_id "MSTRG.117"; transcript_id "gene29892"; gene_name "rps12"; ref_gene_id "gene29892";
NC_007957.1 StringTie exon 146276 147073 1000 + . gene_id "MSTRG.117"; transcript_id "gene29892"; exon_number "1"; gene_name "rps12"; ref_gene_id "gene29892";
What could be possibly wrong as I cannot see any duplicate values!
Really? Are you really going to argue that you just don't see a duplicate ID gene29892 in those first two lines of your ref.gff quote you're showing here ? :)
While the error message might not be very accurate (there is no actual overlap there, for those two duplicated gene IDs), the problem still stands: the ID gene29892 has obvious transcript properties (parenting CDS and exon features) while it is being declared twice in the file (1st at 74631-74744, 2nd time at 146276-147073). This is not a valid GFF3 format, as my understanding is that the feature IDs should be unique.. (though indeed my GFF parser is a bit more lenient about this, accepting non-unique IDs if the features are on separate reference sequences.. which is not the case here).
Then things went downhill ("downstream") from there.. according to the merciless GIGO principle.. I hope you understand, it's hard to somehow automagically fix invalid input reference annotation data, or guess what the authors of those data really meant there -- e.g. that it was somehow a 2-part gene there (?!), so it's OK if they broke the ID uniqueness rule on a whim..
Leaving aside the cheeky fun I had with my reply above, it turns out that in fact I was in the wrong there -- not about the duplicate IDs, but about the validity of that annotation, as this is a special case of trans-splicing where actually the current GFF3 specification does allow for the same ID in the case of discontinuous features like trans-splicing and fusions. So please accept my belated apologies for the incorrect/incomplete answer -- and thanks for this trans-splicing example! I ran into this old closed issue while looking for trans-splicing examples so I can add trans-splicing support to my GFF parser (and thus to gffread, gffcompare etc.).
Hello Geo!
Glad to hear that. It's good that we are coming across such scenarios and I am happy that the project is on constant development. Taking it on a positive note.
Your hard work is much appreciated!
Regards Vijay