Augustus icon indicating copy to clipboard operation
Augustus copied to clipboard

Genbank input file appears to have fewer records than expected.

Open schellt opened this issue 1 year ago • 3 comments

Dear Augustus team, I wanted to retrain an Augustus model with following command using Augustus 3.3.3 (yes, not the latest version but as far as I could see this doesn't influence the issue): [I am shortening the paths for better readability]

autoAug.pl \
        --species=splicata \
        --genome=Splicata_final_assembly_smasked.fasta \
        --cdna=RNA_styela100.fasta \
        -v -v -v \
        --trainingset=splicata_round_1_2genome1_100.all.gff \
        --useexisting \
        --noninteractive

The error I receive is from optimize_augustus.pl. This is the last entry of STDOUT: 1 Running "perl optimize_augustus.pl --rounds=1 --species=splicata --trainOnlyUtr=1 --onlytrain=onlytrain.gb --metapars=splicata_metapars.utr.cfg train.gb --UTR=on > optimize.utr.out"... And this is the complete output of STDERR:

Genbank input file appears to have fewer records than expected.
This could be a consequence of using DOS (Windows) carriage return symbols at line breaks. at /cluster/software/augustus/augustus-3.3.3/scripts/optimize_augustus.pl line 462, <TRAINGB> chunk 33444.
failed to execute: perl optimize_augustus.pl --rounds=1 --species=splicata --trainOnlyUtr=1 --onlytrain=onlytrain.gb  --metapars=splicata_metapars.utr.cfg train.gb --UTR=on > optimize.utr.out!
failed to execute: perl autoAugTrain.pl -g=genome_clean.fa -s=splicata --utr -e=cdna.f.psl --aug=augustus.gff -w=autoAug -v -v -v --opt=1 --useexisting

I tried to track down the error myself but I ended up with more questions than answers.

If I understand correctly this error occurs, when $nloci < @namelines (line 461).

@namelines contains the lines from train.gb matching the pattern "LOCUS", as defined in line 408:

408     my @namelines = grep /^LOCUS   +/, @seqlist;

In lines 446-455 the file train.gb is processed and for every line $nloci is increased by one:

 446     while (<TRAINGB>) {
 447         my $gendaten = $_;
 448         m/^LOCUS +(\S+) .*/;
 449         my $genname = $1;
 450 
 451         $bucket = $bucketmap{$genname};
 452         my $handle = $fh[$bucket];
 453         print $handle $gendaten;
 454         $nloci++;
 455     }

In short: $nloci reflects the number of lines of train.gb and @namelines reflects the number of lines in train.gb, which match "LOCUS". Now I am wondering if this if statement in line 461 becomes true all the time, since @namelines is a subset of all lines of train.gb and thus never can be larger than the total number of lines of train.gb ($nloci).

Furthermore, train.gb was created while running autoAug.pl on a Linux machine and was never in contact with Windows.

Let me know, if you need any further information. Thank you very much in advance. Best, Tilman

schellt avatar Jul 12 '22 16:07 schellt

Please have a look at the input files to the last command, train.gb and onlytrain.gb manually. For example, they could be empty or have too few genes (LOCUS) for a successful training. If they contain no genes at all, a possible cause could be a mismatch between the gff and fasta file, e.g. different sequence names or the coordinates in the gff are from a different assembly version.

MarioStanke avatar Jul 19 '22 14:07 MarioStanke

Hi @MarioStanke , thanks for the reply. I don't think that there are too few genes since I would expect to see the error message from line 413: Number of training sequences is too small.

$ grep -c "LOCUS" train.gb onlytrain.gb 
train.gb:151
onlytrain.gb:3431

A mismatch between gff and fasta would be impossible, since autoAug.pl was executed. Do you have any other idea? Do you have any comment on my thoughts on the code above?

schellt avatar Jul 25 '22 12:07 schellt

Another idea may be that your input training gene set contains too few UTRs. You could try to diagnose (eliminate this possibility) with --noutr.

MarioStanke avatar Jul 25 '22 17:07 MarioStanke