Augustus icon indicating copy to clipboard operation
Augustus copied to clipboard

GBProcessor::getGeneList(): Could not read the following line in Genbank file.

Open YuntaoTan opened this issue 4 years ago • 6 comments

@KatharinaHoff, Hi, I'am using augustus etraining training a very large genome, approxiate ~10Gb. my training set is select from transdecoder result. my gene model is very long. longer than 1Mbp. I got fllowing error:

GBProcessor::getGeneList(): Could not read the following line in Genbank file.
gt ccacctataa taatcatatc ttatttaaaa atcatatgtt
Maximum line length is 
10000.

Encountered error after reading 2455 annotations.
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc

is there any limitation for etraining. when I use it to big genome or long gene models?

YuntaoTan avatar Jan 06 '20 13:01 YuntaoTan

i have the same problem....

phil622 avatar Feb 19 '22 11:02 phil622

In the current code we have the default genbank.hh:#define GBMAXLINELEN 40000 If a single line in a Genbank file has more characters than that then you must use line breaks. Genbank format is human readable and therefore lines are broken, usually after about 100 characters. Alternatively, you can up the 40000 in your code and recompile.

MarioStanke avatar Feb 21 '22 06:02 MarioStanke

@MarioStanke

Hi, I have the same error when I used etraining. So i change 40000 to 500000 in the genbank.hh :#define GBMAXLINELEN 40000 and recompile, but the error arose again:

GBProcessor::getGeneList(): Could not read the following line in Genbank file. tgct ccagtttcag acaaaccata Maximum line length is 499998.

I check the genbank file. A line length is 60 bp in the gb file and each of the sequences' length is not more than 500000. So this way doesn't work. Do you have any other suggestions? Looking forward to your reply.

aic123 avatar May 20 '22 09:05 aic123

Please double check whether none of the lines is longer that the maximum. If so increase the max or simply introduce line breaks in the file. Usually GenBank files have limited line lengths.

MarioStanke avatar May 20 '22 18:05 MarioStanke

The lines are breaked and the max line length is 81 in the genbank file.
The sequences length is much big in my data and I increase the max line limit to 1000000. The error is the same. So i am not sure of the cause of the error.

aic123 avatar May 23 '22 01:05 aic123

I met the same issue, when I use augustus (version=v3.4.0), I trun the Maximum line length to 400000 but I still can't fix my problem. image I wonder if there problem in the script itself...

Neo-xbx-00 avatar May 23 '22 12:05 Neo-xbx-00

I am also encountering the same error. The genbank file in which the error is occurring is an intermediate file created by the BRAKER. It has regular line breaks every 60 bp. If I cut out the head command up to this error line, it is exactly 2 GB. Our genbank file is 4GB. Thank you.

GBProcessor::getGeneList(): Could not read the following line in Genbank file.
tcaaaatttt tacacaaata caaaaaagct aggttaaagc aacaaggata tattaacact
Maximum line length is 
39998.
grep -n 'tcaaaatttt tacacaaata caaaaaagct aggttaaagc aacaaggata tattaacact' tmp_opt_Sp_1/curtrain-6
28289942:     1081 tcaaaatttt tacacaaata caaaaaagct aggttaaagc aacaaggata tattaacact
head -n 28289942 tmp_opt_Sp_1/curtrain-6 > tmpgb
ls -lh tmpgb
rw-r--r-- 1 xxxxxx xxxxxx 2.0G  8月 15 18:20 tmpgb

piroyon avatar Aug 15 '22 09:08 piroyon

Finally, we found a solution.

diff genbank.cc genbank.cc.org 
677c677
<     long fposb, fpose;
---
>     int fposb, fpose;

The maximum value of an int type in C++ is 2,147,483,647 so it can't find a position in the file that is more than that.

Sorry, I just noticed a hint here. https://github.com/Gaius-Augustus/Augustus/issues/353

piroyon avatar Aug 17 '22 08:08 piroyon

Good job. I'll try your method. Thanks.

aic123 avatar Aug 18 '22 03:08 aic123

The error message that included the maximal line length was also shown when the (implicit) maximal input file size was exceeded. This was about 2.1 Gb on many machines. I have used Hiroyos solution after checking that it works on files with more 2^31 Bytes. Please let me know if you see this error after checking out the new version (master branch).

MarioStanke avatar Aug 19 '22 07:08 MarioStanke