BRAKER icon indicating copy to clipboard operation
BRAKER copied to clipboard

Overlapping genes in gtf

Open HeleenDeWeerd opened this issue 2 years ago • 1 comments

Hi,

I was looking in my BRAKER output and I saw some strange things pop up. In the annotation of the genome there are genes which overlap in the output. Here is an example:

ptg000001l	GUSHR   gene    4658185         4661128         .       -       .       gene_id "g40370";
ptg000001l	AUGUSTUS        CDS    4658374 4658736 0.83    -       0       transcript_id "g40370.t1"; gene_id "g40370";
ptg000001l	GUSHR   gene    4658185         4661128         .	+       .       gene_id "g40371";
ptg000001l	AUGUSTUS        CDS    4658808 4658895 0.82    +	0	transcript_id "g40371.t1"; gene_id "g40371";
ptg000001l	AUGUSTUS        CDS    4659004 4659245 0.82    +	2	transcript_id "g40371.t1"; gene_id "g40371";
ptg000001l	GUSHR   gene    4658185         4661128         .	-       .       gene_id "g40372";
ptg000001l	AUGUSTUS        CDS    4659267 4659576 0.68	-	1	transcript_id "g40372.t1"; gene_id "g40372";
ptg000001l	AUGUSTUS        CDS    4659784 4659887 0.29	-	0	transcript_id "g40372.t1"; gene_id "g40372";

For most cases the CDS are in different locations but the gene positions overlap. However in some cases the CDS overlap as well:

ptg000005l	GUSHR   gene    4281555 4285704 .       -       .       gene_id "g53981";
ptg000005l	AUGUSTUS        CDS    4281555 4281748 0.97    -       2       transcript_id "g53981.t1"; gene_id "g53981"; 
ptg000005l	AUGUSTUS        CDS    4284732 4284936 0.62    -       0       transcript_id "g53981.t1"; gene_id "g53981"; 
ptg000005l	AUGUSTUS        CDS    4285639 4285704 0.65    -       0       transcript_id "g53981.t1"; gene_id "g53981"; 
ptg000005l	GUSHR   gene    4281555 4285704 .       -       .       gene_id "g54006";
ptg000005l	AUGUSTUS        CDS    4281555 4281748 0.96    -       2       transcript_id "g54006.t1"; gene_id "g54006"; 
ptg000005l	AUGUSTUS        CDS    4284732 4284936 0.6     -       0       transcript_id "g54006.t1"; gene_id "g54006"; 
ptg000005l	AUGUSTUS        CDS    4285639 4285704 0.72    -       0       transcript_id "g54006.t1"; gene_id "g54006";

These are the exact same gene mapped under two different names.

This happens throughout the mapping, and can make visualizations and analysis of the genome more complicated. Is there a specific reason for this way of annotation? Is there a way to make sure the genes only are mapped to their respective transcript?

Kind regards, Heleen

HeleenDeWeerd avatar Oct 11 '22 15:10 HeleenDeWeerd

Hello, @HeleenDeWeerd

Do you solved this problem?

yuzhenpeng avatar Oct 18 '22 03:10 yuzhenpeng

Hello @yuzhenpeng

Sadly no. I have seen that one of the gtfs created before using GUSHR doesn't seem to have this issue. It might be related to GUSHR. There are still some duplications in that gtf but the problem is much less pronounced.

Regards, Heleen

HeleenDeWeerd avatar Oct 19 '22 10:10 HeleenDeWeerd

We have the same issue -- lots of overlapping genes.

Huiting120 avatar Feb 02 '23 18:02 Huiting120

We have no time to debug this at the moment. Please do not use UTR features of BRAKER for now.

Huiting Zhang @.***> schrieb am Do. 2. Feb. 2023 um 19:18:

We have the same issue -- lots of overlapping genes.

— Reply to this email directly, view it on GitHub https://github.com/Gaius-Augustus/BRAKER/issues/536#issuecomment-1414170648, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJMC6JBHMTZYP7X56K6ONR3WVP273ANCNFSM6AAAAAARCMVWEE . You are receiving this because you were assigned.Message ID: @.***>

KatharinaHoff avatar Feb 02 '23 18:02 KatharinaHoff

I am closing this issue because we won't debug it. We now have a new script to decorate CDS-only transcripts with UTRs.

KatharinaHoff avatar Dec 22 '23 13:12 KatharinaHoff