BRAKER icon indicating copy to clipboard operation
BRAKER copied to clipboard

Warning: invalid GTF record, transcript_id not found (when gffread is used with braker.gtf)

Open cfarkas opened this issue 2 years ago • 5 comments

Hi there,

I aim to annotate with BRAKER the latest Gallus gallus genome assembly (Mar. 2018 (GRCg6a/galGal6, https://hgdownload.soe.ucsc.edu/goldenPath/galGal6/bigZips/), by using the fasta genome sequence and a single bam file as inputs. The run finished successfully, without errors. I tried to extract the assembled transcripts contained in the braker.gtf output with gffread (v0.12.7) as follows:

gffread -w assembled_transcripts.fa -g galGal6.fa braker.gtf

And the following warning raised :

Warning: invalid GTF record, transcript_id not found:
chr12   AUGUSTUS        transcript      8180450 8182241 0.1     -       .       g10294.t1
Warning: invalid GTF record, transcript_id not found:
chr21   AUGUSTUS        transcript      4778026 4778367 0.66    +       .       g17324.t1
Warning: invalid GTF record, transcript_id not found:
chr13   AUGUSTUS        transcript      17234987        17236202        0.26    +       .       g23947.t1
Warning: invalid GTF record, transcript_id not found:
chr10   AUGUSTUS        transcript      11109366        11117540        0.1     +       .       g27562.t1
Warning: invalid GTF record, transcript_id not found:
chr17   AUGUSTUS        transcript      8862995 8866122 0.26    +       .       g9800.t1
Warning: invalid GTF record, transcript_id not found:
chr25   AUGUSTUS        transcript      87722   126141  0.14    +       .       g28243.t1

It seems like many transcript were missing or gffread was unable to parsed those from the GTF file.

Also, I noticed that a higher number of assembled transcripts was extracted from gffread when comparing with the cds sequences contained in the augustus.hints.aa output:

grep ">" assembled_transcripts.fa -c
34247
grep ">" augustus.hints.aa -c
33603

This means that the assembled transcripts obtained from gffread are a mix of coding and non-coding sequences?

The BRAKER run was the following:

# BRAKER CALL: /home/wslab/braker2/bin/braker.pl --genome=../galGal6.fa --softmasking --cores 30 --bam=Illumina_merged.sorted.bam --workingdir=./braker1/

Best Regards,

Carlos Farkas

cfarkas avatar Feb 02 '22 15:02 cfarkas

Hi,

the official output of a BRAKER run is the augustus.hints.gtf gene set. The braker.gtf gene set is the result of merging augustus.hints.gtf with 'high-confidents' genes from the GeneMark prediction. However, the accuracy of the braker.gtf has not been sufficiently tested. Also, there is a known formatting issue with the braker.gtf file, which is probably the reason for your problems with gffread. Please try it again with augustus.hints.gtf. If you want to use the braker.gtf gene set you could use the script fix_gtf_ids.py from TSEBRA.

Best, Lars

LarsGab avatar Feb 03 '22 08:02 LarsGab

Hi @LarsGab and @KatharinaHoff

I am experiencing a similar error when using the augustus.hints.gtf or the tsebra.gtf when combining protein and RNAseq augustus.hints.gtf files. It seems that these files do not have the gene_id and transcript_id tags in column 9 for gene and transcript features, respectively. They only have the IDs, not the tags. All the other features do have the appropriate tags in column 9.

You might ask why I am trying to use gffread when Augustus provides Augustus/scripts/gffGetmRNA.pl and Augustus/scripts/gff2aa.pl for this. I am noticing that these latter scripts have problems when an annotation is partial due to running out of scaffold, either at the beginning or the end of the scaffold. The gff2aa.pl script doesn't appear to take into account the frame information in column 8. It assumes a 0 frame. The gffGetmRNA.pl complains when the transcript feature ends at the end of the scaffold, e.g. "Coordinate of g2622.t1 out of sequence range. 6356 >= 6356"

Cheers,

Chris

ckeeling avatar Feb 22 '22 03:02 ckeeling

Hi @cfarkas @LarsGab and @KatharinaHoff

If you add the gene_id and transcript_id tags as indicated in my comment above, gffread should not give you errors.

I wrote the following little script to do this:

#!/usr/bin/perl -w
#
#
# usage: cat INFILE.gtf ./Fix_Augustus_gtf.pl > OUTFILE.gtf
#
#
#
# what it does:
# Fixes lack of "gene_id" and "transcript_id" tags in column 9 of gtf file from BRAKER/Augustus
#
# CIKeeling 2022-02-24
#
#
use strict;

my @cols;
my $new_nine;
while (<>) {
	chomp;
	@cols = split /\t/, $_;
	$new_nine=$cols[8];
	if ($cols[2]eq"gene") {
		$new_nine="gene_id \"$cols[8]\"";
	} elsif ($cols[2]eq"transcript") {
		$new_nine="transcript_id \"$cols[8]\"";
	}
	print join "\t", $cols[0],$cols[1],$cols[2],$cols[3],$cols[4],$cols[5],$cols[6],$cols[7],$new_nine;
	print "\n";
}

This worked for me with gffread. However, to get the coding nucleotide sequence, and the corresponding amino acid sequence, I found the following to work, using, in part, the getAnnoFasta.pl script in the Augustus package:

# Fix lack of gene_id and transcript_id tags in gtf file column 9
cat tsebra.gtf | Fix_Augustus_gtf.pl > tsebra_fixed.gtf 

# convert gtf to coding and mRNA sequences
# note that the chop_cds option is important to get the proper frame for the translation
Augustus/scripts/getAnnoFasta.pl tsebra_fixed.gtf --seqfile=genome.fa --chop_cds

# translate coding sequence to proteins using transeq from emboss package v6.6.0
transeq -sequence tsebra_fixed.codingseq -frame 1 -outseq tsebra_fixed.aa

# remove the trailing "_1" from the seqid indicating frame 1 that was added by transeq
sed -i 's/_1$//g' tsebra_fixed.aa

The resulting protein sequences have the correct frame for those partial transcripts that start at an end of the scaffold. Hope this helps you, Chris

ckeeling avatar Feb 25 '22 03:02 ckeeling

Hi,

the official output of a BRAKER run is the augustus.hints.gtf gene set. The braker.gtf gene set is the result of merging augustus.hints.gtf with 'high-confidents' genes from the GeneMark prediction. However, the accuracy of the braker.gtf has not been sufficiently tested. Also, there is a known formatting issue with the braker.gtf file, which is probably the reason for your problems with gffread. Please try it again with augustus.hints.gtf. If you want to use the braker.gtf gene set you could use the script fix_gtf_ids.py from TSEBRA.

Best, Lars

Hi Lars,

Sorry for the delayed response, I was out of town. Thank you for the complete response, I appreciate. As suggested, I ran gffread using augustus.hints.gtf instead of braker.gtf, and the number of assembled transcripts was indeed the same number of deduced proteins in augustus.hints.aa:

gffread -w assembled_transcripts-AugustusHints.fa -g galGal6.fa augustus.hints.gtf

grep ">" assembled_transcripts-AugustusHints.fa -c
33603
grep ">" augustus.hints.aa -c
33603

But again, the same warning raised:

Warning: invalid GTF record, transcript_id not found:
chr3    AUGUSTUS        transcript      110672832       110700787       0.2     -       .       g30886.t1
Warning: invalid GTF record, transcript_id not found:
chr3    AUGUSTUS        transcript      110717733       110717975       0.86    +       .       g30887.t1
Warning: invalid GTF record, transcript_id not found:
chr3    AUGUSTUS        transcript      110761664       110781013       0.18    +       .       g30888.t1
Warning: invalid GTF record, transcript_id not found:
chr3    AUGUSTUS        transcript      110783933       110799038       0.42    -       .       g30889.t1
Warning: invalid GTF record, transcript_id not found:
chr3    AUGUSTUS        transcript      110799184       110821777       0.01    -       .       g30890.t1
Warning: invalid GTF record, transcript_id not found:
chr3    AUGUSTUS        transcript      110818905       110819316       0.43    -       .       g30891.t1

On the other hand, I also ran the fix_gtf_ids.py script on braker.gtf prior to gffread, as follows:

# execute fix_gtf_ids.py
wget https://raw.githubusercontent.com/Gaius-Augustus/TSEBRA/main/bin/fix_gtf_ids.py
chmod 755 fix_gtf_ids.py
python fix_gtf_ids.py --gtf braker.gtf --out braker.fix.gtf

# gffread
gffread -w assembled_transcripts.fix.fa -g galGal6.fa braker.fix.gtf

grep ">" assembled_transcripts.fix.fa -c
34247

And the number of assembled transcripts was also the same for braker.gtf (34247), regardless the use of fix_gtf_ids.py.

In summary:

  1. The number of assembled transcripts from augustus.hints.gtf are consistent with the number of proteins contained in augustus.hints.aa.
  2. Regardless the format issues of braker.gtf, more number of transcripts are resolved (and potentially more proteins). Do you think it is beneficial to run AUGUSTUS or TransDecoder using braker transcripts as input in order to discover more proteins?

Best Regards,

Carlos

cfarkas avatar Feb 25 '22 13:02 cfarkas

Hi,

regarding gffread: @ckeeling Thanks for your fix! I tested gffread with a small example and it looks like the 'gene' and 'transcript' feature lines are optional for the input format of gffread. So, you could remove these lines from your file, which would get rid of the warnings, or it might even be safe to just ignore the warnings (I haven't tested it enough to be sure). Or just use the fix from Chris.

@Carlos I'm not sure which of the two ways would produce the better result. It is possible that AUGUSTUS doesn't add a lot of new proteins to the gene set, since BRAKER already ran it for each locus. You could also try to run BRAKER2 (BRAKER in ep mode) with only a protein database as extrinsic evidence and combine the results with TSEBRA.

Also, consider running BRAKER with the species parameters of AUGUSTUS' chickens, these parameters should be better than the ones BRAKER trains itself, e.g. with the additional options --species chicken --useexisting .

Best, Lars

LarsGab avatar Feb 28 '22 11:02 LarsGab