Augustus
Augustus copied to clipboard
Genbank input file appears to have fewer records than expected.
Dear Augustus team, I wanted to retrain an Augustus model with following command using Augustus 3.3.3 (yes, not the latest version but as far as I could see this doesn't influence the issue): [I am shortening the paths for better readability]
autoAug.pl \
--species=splicata \
--genome=Splicata_final_assembly_smasked.fasta \
--cdna=RNA_styela100.fasta \
-v -v -v \
--trainingset=splicata_round_1_2genome1_100.all.gff \
--useexisting \
--noninteractive
The error I receive is from optimize_augustus.pl
.
This is the last entry of STDOUT
:
1 Running "perl optimize_augustus.pl --rounds=1 --species=splicata --trainOnlyUtr=1 --onlytrain=onlytrain.gb --metapars=splicata_metapars.utr.cfg train.gb --UTR=on > optimize.utr.out"...
And this is the complete output of STDERR
:
Genbank input file appears to have fewer records than expected.
This could be a consequence of using DOS (Windows) carriage return symbols at line breaks. at /cluster/software/augustus/augustus-3.3.3/scripts/optimize_augustus.pl line 462, <TRAINGB> chunk 33444.
failed to execute: perl optimize_augustus.pl --rounds=1 --species=splicata --trainOnlyUtr=1 --onlytrain=onlytrain.gb --metapars=splicata_metapars.utr.cfg train.gb --UTR=on > optimize.utr.out!
failed to execute: perl autoAugTrain.pl -g=genome_clean.fa -s=splicata --utr -e=cdna.f.psl --aug=augustus.gff -w=autoAug -v -v -v --opt=1 --useexisting
I tried to track down the error myself but I ended up with more questions than answers.
If I understand correctly this error occurs, when $nloci < @namelines
(line 461).
@namelines
contains the lines from train.gb
matching the pattern "LOCUS", as defined in line 408:
408 my @namelines = grep /^LOCUS +/, @seqlist;
In lines 446-455 the file train.gb
is processed and for every line $nloci
is increased by one:
446 while (<TRAINGB>) {
447 my $gendaten = $_;
448 m/^LOCUS +(\S+) .*/;
449 my $genname = $1;
450
451 $bucket = $bucketmap{$genname};
452 my $handle = $fh[$bucket];
453 print $handle $gendaten;
454 $nloci++;
455 }
In short: $nloci
reflects the number of lines of train.gb
and @namelines
reflects the number of lines in train.gb
, which match "LOCUS".
Now I am wondering if this if
statement in line 461 becomes true all the time, since @namelines
is a subset of all lines of train.gb
and thus never can be larger than the total number of lines of train.gb
($nloci
).
Furthermore, train.gb
was created while running autoAug.pl
on a Linux machine and was never in contact with Windows.
Let me know, if you need any further information. Thank you very much in advance. Best, Tilman
Please have a look at the input files to the last command, train.gb
and onlytrain.gb
manually. For example, they could be empty or have too few genes (LOCUS) for a successful training.
If they contain no genes at all, a possible cause could be a mismatch between the gff and fasta file, e.g. different sequence names or the coordinates in the gff are from a different assembly version.
Hi @MarioStanke ,
thanks for the reply.
I don't think that there are too few genes since I would expect to see the error message from line 413: Number of training sequences is too small
.
$ grep -c "LOCUS" train.gb onlytrain.gb
train.gb:151
onlytrain.gb:3431
A mismatch between gff and fasta would be impossible, since autoAug.pl
was executed.
Do you have any other idea? Do you have any comment on my thoughts on the code above?
Another idea may be that your input training gene set contains too few UTRs. You could try to diagnose (eliminate this possibility) with --noutr
.