BRAKER icon indicating copy to clipboard operation
BRAKER copied to clipboard

Large genome - potential overflow errer

Open tbrown91 opened this issue 2 years ago • 7 comments

Hi,

I am working on a very large genome (~20Gb) and am trying to train an augustus model using transcript data. I wonder if there is an issue in the GeneMark-ET step as it gives a negative genome size in the file GeneMark-ET/info/data.general:

LABEL data/dna.fna SEQUENCE_SIZE -1174366646 SEQUENCE_ACGT 1973249640 NT_A 1814604624 NT_C 1320099751 NT_G 1321270732 NT_T 1812241829 NT_N 0 NT_X 1147351010 SEQUENCE_atcg 0 NT_a 0 NT_c 0 NT_g 0 NT_t 0 NT_n 0 NT_x 0 SEQUENCE_other 1147351010 NT_X 1147351010 GC -83.8 RECORDS 1069

I appreciate this may not be an inherent braker issue, but would appreciate any guidance on how to get around this. The input sequence is made up of lower- and upper-case nucleotides. I'm not sure where the "other" records are coming from.

Many thanks,

Tom

tbrown91 avatar Sep 27 '21 13:09 tbrown91

Hi Tom,

I am not sure whether the rest of GeneMark-ET depends on this info. I asked Alex Lomsadze, the original author of GeneMark-ET, whether this would cause any issues. I'll post his answer here when I get it.

Were you able to run the rest of BRAKER (if you proceeded with the run)?

Best, Tomas

tomasbruna avatar Oct 12 '21 14:10 tomasbruna

Hi Tomas,

Thank you for following up. Hopefully Alex will get back to you. I ran braker in two modes and neither of the outputs were ideal. I ran once with intron locations used as a hints file and this crashed saying a dna.fa file was missing and when I ran with the bam file from an RNAseq run as hints the final results were not good. I ran BUSCO on the inferred amino acid sequences and got a score of about 0.5%. Unfortunately there are no errors being thrown along the way, this overflow is the first error I ran into when trying to see what had gone wrong.

All the best,

Tom

tbrown91 avatar Oct 13 '21 09:10 tbrown91

Hi Tom,

can you share the braker.log and also a file named GeneMark-ET.stdout, if you have it?

I'd like to take a look, maybe I'll see something else which is suspicious... I'll also share the files with Alex.

Best, Tomas

tomasbruna avatar Oct 13 '21 15:10 tomasbruna

braker.upload.log GeneMark-ET.upload.stdout.txt

Hi Tomas,

This is a run that crashed due to a timeout error, so the braker log is incomplete. The GeneMark-ET step finished though, so this should be complete. I hope it is somehow useful to you.

Thanks,

Tom

tbrown91 avatar Oct 14 '21 08:10 tbrown91

Hi Tom,

Alex told me that he fixed the overflow error. Can you download the newest GeneMark version from http://exon.gatech.edu/GeneMark/license_download.cgi and test whether it helped?

Best, Tomas

tomasbruna avatar Oct 22 '21 20:10 tomasbruna

Hi Tomas,

Thank you for the fix! It looks like the genome size is correct now, but the A/C/G/T/Other content doesn't looks correct. Do you know if this is important? The input sequence is soft-masked, so the count of acgt=0 below is a bit concerning. I will let the run finish (will take a few days) and have a look at the output then:

LABEL data/dna.fna SEQUENCE_SIZE 20300469834 SEQUENCE_ACGT 6041540776 NT_A 1751020995 NT_C 1270295426 NT_G 1271612733 NT_T 1748611622 NT_N 0 NT_X 14258929058 SEQUENCE_atcg 0 NT_a 0 NT_c 0 NT_g 0 NT_t 0 NT_n 0 NT_x 0 SEQUENCE_other 14258929058 NT_X 14258929058 GC 42.1 RECORDS 1069

Many thanks again,

Tom

tbrown91 avatar Oct 23 '21 10:10 tbrown91

Hi Tomas,

The run finished and the results were not ideal. Below are the busco scores based on the augustus.hints.codingseq file:

C:0.2%[S:0.2%,D:0.0%],F:0.9%,M:98.9%

I have attached the braker log from the finished run in case it is useful

Thanks,

Tom

braker.log

tbrown91 avatar Oct 27 '21 14:10 tbrown91