BRAKER icon indicating copy to clipboard operation
BRAKER copied to clipboard

Combined GeneMark and Augustus gtf conversion issues.

Open Rob-murphys opened this issue 3 years ago • 12 comments

Whilst converting the myNew.gtf to EMBL i had some issues so posted on Biostars and someone from the Swedish Bioinformatics Institute commented the following:

Full post here

BTW something is wierd in your GTF file:

ANC-D_8901  AUGUSTUS    gene    1   783 0.7 +   .   g4417
ANC-D_8901  AUGUSTUS    transcript  1   783 0.7 +   .   g4417.t1
ANC-D_8901  AUGUSTUS    exon    670 783 .   +   .   transcript_id "file_1_file_1_g4417.t1"; gene_id "file_1_file_1_g4417";
ANC-D_8901  AUGUSTUS    stop_codon  781 783 .   +   0   transcript_id "file_1_file_1_g4417.t1"; gene_id "file_1_file_1_g4417";

transcript_id "file_1_file_1_g4417.t1"; gene_id "file_1_file_1_g4417"; must be transcript_id "g4417.t1"; gene_id "g4417"; AGAT will not group the features together because they do not have proper relationships. In top of that I don't get where is the CDS, as you have a stop_codon feature defined.

If you didn't modify the file, and this is the original BRAKER output you should get in touch with them to show that. There are issues

I thought it might be worth brining it to your guys attention.

Rob-murphys avatar Apr 19 '21 06:04 Rob-murphys

Can you do me a favor and send me an e-mail to @.*** with a contact from the Swedish Bioinformatics Institute?

I tend to disagree on both points content-wise (the gene id is a valid string, the CDS can be computed from the stop codon, however it is an incomplete gene), but we are of course interested in making BRAKER output compatible. That might be easier if I get in touch with someone there, directly.

That said, it might take me weeks to get to it. Totally swamped with work on short on daycare for the kid.

On Mon, Apr 19, 2021 at 8:52 AM Lamm-a @.***> wrote:

Whilst converting the myNew.gtf to EMBL i had some issues so posted on Biostars and someone from the Swedish Bioinformatics Institute commented the following:

Full post here https://www.biostars.org/p/9464353/#9465155

BTW something is wierd in your GTF file:

ANC-D_8901  AUGUSTUS    gene    1   783 0.7 +   .   g4417
ANC-D_8901  AUGUSTUS    transcript  1   783 0.7 +   .   g4417.t1
ANC-D_8901  AUGUSTUS    exon    670 783 .   +   .   transcript_id "file_1_file_1_g4417.t1"; gene_id "file_1_file_1_g4417";
ANC-D_8901  AUGUSTUS    stop_codon  781 783 .   +   0   transcript_id "file_1_file_1_g4417.t1"; gene_id "file_1_file_1_g4417";

transcript_id "file_1_file_1_g4417.t1"; gene_id "file_1_file_1_g4417"; must be transcript_id "g4417.t1"; gene_id "g4417"; AGAT will not group the features together because they do not have proper relationships. In top of that I don't get where is the CDS, as you have a stop_codon feature defined.

If you didn't modify the file, and this is the original BRAKER output you should get in touch with them to show that. There are issues

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Gaius-Augustus/BRAKER/issues/354, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJMC6JDQCZDXKS6XNQZNUN3TJPHJ3ANCNFSM43FF45XQ .

KatharinaHoff avatar Apr 19 '21 07:04 KatharinaHoff

Hello ^^, I'm the person that made those comments. I'm expert in genome annotation (I have worked 7 years exclusively in this field). I made an exhaustive review of th GFF/GTF format here https://agat.readthedocs.io/en/latest/gxf.html

The dogma in those formats is to have relationships between features to create a "record" (exon is part-of a mRNA which is part-of a gene, etc. All together are a "record"). The GFF use ID/Parent relationships to link features together. The GTF is easier and use gene_id, transcipt_id and share it with all features of a record like in this example:

ANC-D_8901  AUGUSTUS    gene    1   783 0.7 +   .   g4417
ANC-D_8901  AUGUSTUS    transcript  1   783 0.7 +   .   g4417.t1
ANC-D_8901  AUGUSTUS    exon    670 783 .   +   .   transcript_id "g4417.t1"; gene_id "g4417";
ANC-D_8901  AUGUSTUS    stop_codon  781 783 .   +   0   transcript_id "g4417.t1"; gene_id "g4417";

I don't know all the details about BRAKER but i was saying to @Lamm-a several things (assuming the example we get is raw output from BRAKER, properly extracted and not modified by the user) :

  • attribute g4417 for gene feature and g4417.t1 for transcript feature is wrong because attribute must be key-value pair. I know you are aware of it (the issue has been raised several times). This way of doing was only in the old fashioned GFF version 1.

  • In the following example it sounds BRAKER says that the features together represent a record, but if it is the case this is wrong because if we consider that g4417.t1 is the transcript feature identifier, the transcript_id attribute for exon and stop_codon features shoud be also g4417.t1 and not file_1_file_1_g4417.t.

ANC-D_8901  AUGUSTUS    gene    1   783 0.7 +   .   g4417
ANC-D_8901  AUGUSTUS    transcript  1   783 0.7 +   .   g4417.t1
ANC-D_8901  AUGUSTUS    exon    670 783 .   +   .   transcript_id "file_1_file_1_g4417.t1"; gene_id "file_1_file_1_g4417";
ANC-D_8901  AUGUSTUS    stop_codon  781 783 .   +   0   transcript_id "file_1_file_1_g4417.t1"; gene_id "file_1_file_1_g4417";

The string itself is correct in the sense that the transcript_id and gene_id can contain any string, but the "syntax" is wrong because it does not hold the proper information.

  • Indeed the CDS is not mandatory to be present in the record, but it is awkward to have a stop codon which means that a CDS exists, and no CDS feature is present. I never seen that from Augustus output until now. If the user can deduce the CDS from it, it should be aware about the start_codon too. I'm wondering if the rest of the features e.g. start_codon and CDS are not present elsewhere in the file. Indeed, if I remember well it is not mandatory that related features should be consecutive in the file but it shoul be strongly recommanded. In this case the BRAKER output can be improved. Once again I never seen that until now in augustus output.

Best,

Jacques

Juke34 avatar Apr 19 '21 08:04 Juke34

Thanks a lot, @Juke34! The issue is here is not to rename the feature IDs to g4417, but to make them consistent across all lines. I didn't even see the inconsistency on my phone display this morning. That was indeed a serious bug. It should be fixed with commit https://github.com/Gaius-Augustus/BRAKER/commit/f38630c1cad3e11b525f84d517c7949cb4c2d7eb . @Lamm-a please make git pull and test whether it solves that problem for you.

As to the CDS: this particular example is an incomplete gene & transcript. It does not have a start codon. The CDS begins at 670 and ends at 780 (stop codon excluded) or 783 (stop codon included).

From the top of my head I am not quite sure where we drop the CDS features in BRAKER (AUGUSTUS does not drop them). But we only drop them because they are implicit from start/stop/exon features.

It is possible to suppress the prediction of incomplete genes with --augustus_args="--genemodel=complete" when running BRAKER. You'll then always have a start and a stop to infer the CDS.

(We will likely not fix the CDS feature problem because it will automatically resolve within the next couple of months because ETP mode will be completely changed/removed/replaced.)

KatharinaHoff avatar Apr 19 '21 15:04 KatharinaHoff

@KatharinaHoff no worries

To continue about the CDS (in case it is really dropped by BRAKER because I find that very strange. Did you check the @Lamm-a file?) The problem to not have the CDS feature for incomplete gene models is that you don't know where to start the translation if you want to get the partial protein. The CDS feature has information the phase column allowing to know in which phase is the first nucleotide. You may guess it by computing it from the stop codon but most of the tools wont (And Augustus do not always predict stop codon neither. Some gene models do not have either start and stop. So wihtout CDS feature and phase information it becomes really complicated to get the partial protein). Anyway most of the tools will ask for the CDS to perform any translation. Translation are important even for partial protein to perform functional annotation. You often find conserved domains in partial genes. Public archives (ENA, NCBI) accept partial CDS, but it has to be defined in the GFF before conversion. So those partial CDS are lost between annotation and submission if they are not included since the beginning in the GFF.

Interoperability and standardisation is a real problem in genome annotations. As BRAKER is widely used, it is good to show of good standards ^^. BRAKER improves over time, thank you for the efforts you put into this.

Looking forward for this ETP mode.

Juke34 avatar Apr 19 '21 15:04 Juke34

I am 100% sure that AUGUSTUS does not drop the CDS features (as AUGUSTUS the C++ based tool, AUGUSTUS scripts for postprocessing are a different story). AUGUSTUS can predict noncoding genes, but then there's no stop_codon and no start_codon. Also, BRAKER does not by default predict the noncoding genes.

@Lamm-a how did you create that output file? Are you using the most recent AUGUSTUS and BRAKER code from github?

KatharinaHoff avatar Apr 19 '21 16:04 KatharinaHoff

@KatharinaHoff This is just the standard braker.gtf output. I will check the exact version I am using.

The exact command is: braker.pl --cores=8 --fungus --genome=$input --prot_seq=$protDB --softmasking --workingdir=$outdir --species=$species --useexisting --skip_fixing_broken_genes --AUGUSTUS_CONFIG_PATH=/home/people/robmur/robmur/programmes/braker/config --AUGUSTUS_BIN_PATH=/services/tools/augustus/3.4.0/bin --AUGUSTUS_SCRIPTS_PATH=/home/people/robmur/robmur/programmes/braker/augustus_3.4.0/scripts --PROTHINT_PATH=/services/tools/prothint/2.6.0/bin/prothint.py

Note the --skip_fixing_broken_genes This was a result of the issue mentioned in #331 to which no fix worked.

Rob-murphys avatar Apr 19 '21 17:04 Rob-murphys

@Lamm-a could you copy pas a chunk of the output file, e.g. 50 lines?

Juke34 avatar Apr 19 '21 18:04 Juke34

Here is the whole output for braker.gtf PseudaA_sorted_masked.zip

Here is a combined out for : cat GeneMark-EP/genemark.gtf augustus.hints.gtf > myNew.gtf which Tomas suggested would be better for fungus PseudaA_sorted_masked_combined.gtf.zip

Rob-murphys avatar Apr 19 '21 19:04 Rob-murphys

Ok here is the full record of the first example... there is no CDS problem ^^. This is good to know.

PseudaA_8021	AUGUSTUS	CDS	2	31	0.82	+	0	transcript_id "file_1_file_1_g4417.t1"; gene_id "file_1_file_1_g4417";
PseudaA_8021	AUGUSTUS	exon	2	31	.	+	.	transcript_id "file_1_file_1_g4417.t1"; gene_id "file_1_file_1_g4417";
PseudaA_8021	AUGUSTUS	intron	32	84	1	+	.	transcript_id "file_1_file_1_g4417.t1"; gene_id "file_1_file_1_g4417";
PseudaA_8021	AUGUSTUS	CDS	85	239	1	+	0	transcript_id "file_1_file_1_g4417.t1"; gene_id "file_1_file_1_g4417";
PseudaA_8021	AUGUSTUS	exon	85	239	.	+	.	transcript_id "file_1_file_1_g4417.t1"; gene_id "file_1_file_1_g4417";
PseudaA_8021	AUGUSTUS	intron	240	288	1	+	.	transcript_id "file_1_file_1_g4417.t1"; gene_id "file_1_file_1_g4417";
PseudaA_8021	AUGUSTUS	CDS	289	400	1	+	1	transcript_id "file_1_file_1_g4417.t1"; gene_id "file_1_file_1_g4417";
PseudaA_8021	AUGUSTUS	exon	289	400	.	+	.	transcript_id "file_1_file_1_g4417.t1"; gene_id "file_1_file_1_g4417";
PseudaA_8021	AUGUSTUS	intron	401	452	0.98	+	.	transcript_id "file_1_file_1_g4417.t1"; gene_id "file_1_file_1_g4417";
PseudaA_8021	AUGUSTUS	CDS	453	556	0.98	+	0	transcript_id "file_1_file_1_g4417.t1"; gene_id "file_1_file_1_g4417";
PseudaA_8021	AUGUSTUS	exon	453	556	.	+	.	transcript_id "file_1_file_1_g4417.t1"; gene_id "file_1_file_1_g4417";
PseudaA_8021	AUGUSTUS	intron	557	606	1	+	.	transcript_id "file_1_file_1_g4417.t1"; gene_id "file_1_file_1_g4417";
PseudaA_8021	AUGUSTUS	CDS	607	686	1	+	1	transcript_id "file_1_file_1_g4417.t1"; gene_id "file_1_file_1_g4417";
PseudaA_8021	AUGUSTUS	exon	607	686	.	+	.	transcript_id "file_1_file_1_g4417.t1"; gene_id "file_1_file_1_g4417";
PseudaA_8021	AUGUSTUS	intron	687	743	1	+	.	transcript_id "file_1_file_1_g4417.t1"; gene_id "file_1_file_1_g4417";
PseudaA_8021	AUGUSTUS	CDS	744	754	1	+	2	transcript_id "file_1_file_1_g4417.t1"; gene_id "file_1_file_1_g4417";
PseudaA_8021	AUGUSTUS	exon	744	754	.	+	.	transcript_id "file_1_file_1_g4417.t1"; gene_id "file_1_file_1_g4417";
PseudaA_8021	AUGUSTUS	intron	755	800	1	+	.	transcript_id "file_1_file_1_g4417.t1"; gene_id "file_1_file_1_g4417";
PseudaA_8021	AUGUSTUS	CDS	801	916	1	+	0	transcript_id "file_1_file_1_g4417.t1"; gene_id "file_1_file_1_g4417";
PseudaA_8021	AUGUSTUS	exon	801	916	.	+	.	transcript_id "file_1_file_1_g4417.t1"; gene_id "file_1_file_1_g4417";
PseudaA_8021	AUGUSTUS	intron	917	964	1	+	.	transcript_id "file_1_file_1_g4417.t1"; gene_id "file_1_file_1_g4417";
PseudaA_8021	AUGUSTUS	CDS	965	1315	1	+	1	transcript_id "file_1_file_1_g4417.t1"; gene_id "file_1_file_1_g4417";
PseudaA_8021	AUGUSTUS	gene	2	1315	1.62	+	.	g4417
PseudaA_8021	AUGUSTUS	transcript	2	1315	0.81	+	.	g4417.t1
PseudaA_8021	AUGUSTUS	transcript	2	1315	0.81	+	.	g4417.t2
PseudaA_8021	AUGUSTUS	exon	965	1315	.	+	.	transcript_id "file_1_file_1_g4417.t1"; gene_id "file_1_file_1_g4417";

Juke34 avatar Apr 19 '21 19:04 Juke34

@Juke34 Excellent :) So the conversions you have suggested of removing longest isoforms and gtf > gff3 then gff3 > embl are all good to keep doing?

Rob-murphys avatar Apr 20 '21 05:04 Rob-murphys

Yes for what you want to achieve it sounds the best solution

Juke34 avatar Apr 20 '21 07:04 Juke34

Thanks a lot, @Juke34! The issue is here is not to rename the feature IDs to g4417, but to make them consistent across all lines. I didn't even see the inconsistency on my phone display this morning. That was indeed a serious bug. It should be fixed with commit f38630c . @Lamm-a please make git pull and test whether it solves that problem for you.

As to the CDS: this particular example is an incomplete gene & transcript. It does not have a start codon. The CDS begins at 670 and ends at 780 (stop codon excluded) or 783 (stop codon included).

From the top of my head I am not quite sure where we drop the CDS features in BRAKER (AUGUSTUS does not drop them). But we only drop them because they are implicit from start/stop/exon features.

It is possible to suppress the prediction of incomplete genes with --augustus_args="--genemodel=complete" when running BRAKER. You'll then always have a start and a stop to infer the CDS.

(We will likely not fix the CDS feature problem because it will automatically resolve within the next couple of months because ETP mode will be completely changed/removed/replaced.)

Dear Khatarina,

This error in the transcript names still remain after the fix.

miRNA67 avatar May 24 '21 20:05 miRNA67