BRAKER
BRAKER copied to clipboard
Warning: invalid GTF record, transcript_id not found (when gffread is used with braker.gtf)
Hi there,
I aim to annotate with BRAKER the latest Gallus gallus genome assembly (Mar. 2018 (GRCg6a/galGal6, https://hgdownload.soe.ucsc.edu/goldenPath/galGal6/bigZips/), by using the fasta genome sequence and a single bam file as inputs. The run finished successfully, without errors. I tried to extract the assembled transcripts contained in the braker.gtf
output with gffread (v0.12.7) as follows:
gffread -w assembled_transcripts.fa -g galGal6.fa braker.gtf
And the following warning raised :
Warning: invalid GTF record, transcript_id not found:
chr12 AUGUSTUS transcript 8180450 8182241 0.1 - . g10294.t1
Warning: invalid GTF record, transcript_id not found:
chr21 AUGUSTUS transcript 4778026 4778367 0.66 + . g17324.t1
Warning: invalid GTF record, transcript_id not found:
chr13 AUGUSTUS transcript 17234987 17236202 0.26 + . g23947.t1
Warning: invalid GTF record, transcript_id not found:
chr10 AUGUSTUS transcript 11109366 11117540 0.1 + . g27562.t1
Warning: invalid GTF record, transcript_id not found:
chr17 AUGUSTUS transcript 8862995 8866122 0.26 + . g9800.t1
Warning: invalid GTF record, transcript_id not found:
chr25 AUGUSTUS transcript 87722 126141 0.14 + . g28243.t1
It seems like many transcript were missing or gffread was unable to parsed those from the GTF file.
Also, I noticed that a higher number of assembled transcripts was extracted from gffread
when comparing with the cds sequences contained in the augustus.hints.aa
output:
grep ">" assembled_transcripts.fa -c
34247
grep ">" augustus.hints.aa -c
33603
This means that the assembled transcripts obtained from gffread are a mix of coding and non-coding sequences?
The BRAKER run was the following:
# BRAKER CALL: /home/wslab/braker2/bin/braker.pl --genome=../galGal6.fa --softmasking --cores 30 --bam=Illumina_merged.sorted.bam --workingdir=./braker1/
Best Regards,
Carlos Farkas
Hi,
the official output of a BRAKER run is the augustus.hints.gtf
gene set. The braker.gtf
gene set is the result of merging augustus.hints.gtf
with 'high-confidents' genes from the GeneMark prediction. However, the accuracy of the braker.gtf
has not been sufficiently tested. Also, there is a known formatting issue with the braker.gtf
file, which is probably the reason for your problems with gffread
.
Please try it again with augustus.hints.gtf
. If you want to use the braker.gtf
gene set you could use the script fix_gtf_ids.py
from TSEBRA.
Best, Lars
Hi @LarsGab and @KatharinaHoff
I am experiencing a similar error when using the augustus.hints.gtf or the tsebra.gtf when combining protein and RNAseq augustus.hints.gtf files. It seems that these files do not have the gene_id and transcript_id tags in column 9 for gene and transcript features, respectively. They only have the IDs, not the tags. All the other features do have the appropriate tags in column 9.
You might ask why I am trying to use gffread when Augustus provides Augustus/scripts/gffGetmRNA.pl and Augustus/scripts/gff2aa.pl for this. I am noticing that these latter scripts have problems when an annotation is partial due to running out of scaffold, either at the beginning or the end of the scaffold. The gff2aa.pl script doesn't appear to take into account the frame information in column 8. It assumes a 0 frame. The gffGetmRNA.pl complains when the transcript feature ends at the end of the scaffold, e.g. "Coordinate of g2622.t1 out of sequence range. 6356 >= 6356"
Cheers,
Chris
Hi @cfarkas @LarsGab and @KatharinaHoff
If you add the gene_id and transcript_id tags as indicated in my comment above, gffread should not give you errors.
I wrote the following little script to do this:
#!/usr/bin/perl -w
#
#
# usage: cat INFILE.gtf ./Fix_Augustus_gtf.pl > OUTFILE.gtf
#
#
#
# what it does:
# Fixes lack of "gene_id" and "transcript_id" tags in column 9 of gtf file from BRAKER/Augustus
#
# CIKeeling 2022-02-24
#
#
use strict;
my @cols;
my $new_nine;
while (<>) {
chomp;
@cols = split /\t/, $_;
$new_nine=$cols[8];
if ($cols[2]eq"gene") {
$new_nine="gene_id \"$cols[8]\"";
} elsif ($cols[2]eq"transcript") {
$new_nine="transcript_id \"$cols[8]\"";
}
print join "\t", $cols[0],$cols[1],$cols[2],$cols[3],$cols[4],$cols[5],$cols[6],$cols[7],$new_nine;
print "\n";
}
This worked for me with gffread. However, to get the coding nucleotide sequence, and the corresponding amino acid sequence, I found the following to work, using, in part, the getAnnoFasta.pl script in the Augustus package:
# Fix lack of gene_id and transcript_id tags in gtf file column 9
cat tsebra.gtf | Fix_Augustus_gtf.pl > tsebra_fixed.gtf
# convert gtf to coding and mRNA sequences
# note that the chop_cds option is important to get the proper frame for the translation
Augustus/scripts/getAnnoFasta.pl tsebra_fixed.gtf --seqfile=genome.fa --chop_cds
# translate coding sequence to proteins using transeq from emboss package v6.6.0
transeq -sequence tsebra_fixed.codingseq -frame 1 -outseq tsebra_fixed.aa
# remove the trailing "_1" from the seqid indicating frame 1 that was added by transeq
sed -i 's/_1$//g' tsebra_fixed.aa
The resulting protein sequences have the correct frame for those partial transcripts that start at an end of the scaffold. Hope this helps you, Chris
Hi,
the official output of a BRAKER run is the
augustus.hints.gtf
gene set. Thebraker.gtf
gene set is the result of mergingaugustus.hints.gtf
with 'high-confidents' genes from the GeneMark prediction. However, the accuracy of thebraker.gtf
has not been sufficiently tested. Also, there is a known formatting issue with thebraker.gtf
file, which is probably the reason for your problems withgffread
. Please try it again withaugustus.hints.gtf
. If you want to use thebraker.gtf
gene set you could use the scriptfix_gtf_ids.py
from TSEBRA.Best, Lars
Hi Lars,
Sorry for the delayed response, I was out of town. Thank you for the complete response, I appreciate. As suggested, I ran gffread using augustus.hints.gtf
instead of braker.gtf
, and the number of assembled transcripts was indeed the same number of deduced proteins in augustus.hints.aa
:
gffread -w assembled_transcripts-AugustusHints.fa -g galGal6.fa augustus.hints.gtf
grep ">" assembled_transcripts-AugustusHints.fa -c
33603
grep ">" augustus.hints.aa -c
33603
But again, the same warning raised:
Warning: invalid GTF record, transcript_id not found:
chr3 AUGUSTUS transcript 110672832 110700787 0.2 - . g30886.t1
Warning: invalid GTF record, transcript_id not found:
chr3 AUGUSTUS transcript 110717733 110717975 0.86 + . g30887.t1
Warning: invalid GTF record, transcript_id not found:
chr3 AUGUSTUS transcript 110761664 110781013 0.18 + . g30888.t1
Warning: invalid GTF record, transcript_id not found:
chr3 AUGUSTUS transcript 110783933 110799038 0.42 - . g30889.t1
Warning: invalid GTF record, transcript_id not found:
chr3 AUGUSTUS transcript 110799184 110821777 0.01 - . g30890.t1
Warning: invalid GTF record, transcript_id not found:
chr3 AUGUSTUS transcript 110818905 110819316 0.43 - . g30891.t1
On the other hand, I also ran the fix_gtf_ids.py
script on braker.gtf
prior to gffread, as follows:
# execute fix_gtf_ids.py
wget https://raw.githubusercontent.com/Gaius-Augustus/TSEBRA/main/bin/fix_gtf_ids.py
chmod 755 fix_gtf_ids.py
python fix_gtf_ids.py --gtf braker.gtf --out braker.fix.gtf
# gffread
gffread -w assembled_transcripts.fix.fa -g galGal6.fa braker.fix.gtf
grep ">" assembled_transcripts.fix.fa -c
34247
And the number of assembled transcripts was also the same for braker.gtf (34247), regardless the use of fix_gtf_ids.py
.
In summary:
- The number of assembled transcripts from
augustus.hints.gtf
are consistent with the number of proteins contained inaugustus.hints.aa
. - Regardless the format issues of
braker.gtf
, more number of transcripts are resolved (and potentially more proteins). Do you think it is beneficial to run AUGUSTUS or TransDecoder using braker transcripts as input in order to discover more proteins?
Best Regards,
Carlos
Hi,
regarding gffread: @ckeeling Thanks for your fix! I tested gffread with a small example and it looks like the 'gene' and 'transcript' feature lines are optional for the input format of gffread. So, you could remove these lines from your file, which would get rid of the warnings, or it might even be safe to just ignore the warnings (I haven't tested it enough to be sure). Or just use the fix from Chris.
@Carlos I'm not sure which of the two ways would produce the better result. It is possible that AUGUSTUS doesn't add a lot of new proteins to the gene set, since BRAKER already ran it for each locus. You could also try to run BRAKER2 (BRAKER in ep mode) with only a protein database as extrinsic evidence and combine the results with TSEBRA.
Also, consider running BRAKER with the species parameters of AUGUSTUS' chickens, these parameters should be better than the ones BRAKER trains itself, e.g. with the additional options --species chicken --useexisting
.
Best, Lars