BRAKER
BRAKER copied to clipboard
Large genome - potential overflow errer
Hi,
I am working on a very large genome (~20Gb) and am trying to train an augustus model using transcript data. I wonder if there is an issue in the GeneMark-ET step as it gives a negative genome size in the file GeneMark-ET/info/data.general:
LABEL data/dna.fna SEQUENCE_SIZE -1174366646 SEQUENCE_ACGT 1973249640 NT_A 1814604624 NT_C 1320099751 NT_G 1321270732 NT_T 1812241829 NT_N 0 NT_X 1147351010 SEQUENCE_atcg 0 NT_a 0 NT_c 0 NT_g 0 NT_t 0 NT_n 0 NT_x 0 SEQUENCE_other 1147351010 NT_X 1147351010 GC -83.8 RECORDS 1069
I appreciate this may not be an inherent braker issue, but would appreciate any guidance on how to get around this. The input sequence is made up of lower- and upper-case nucleotides. I'm not sure where the "other" records are coming from.
Many thanks,
Tom
Hi Tom,
I am not sure whether the rest of GeneMark-ET depends on this info. I asked Alex Lomsadze, the original author of GeneMark-ET, whether this would cause any issues. I'll post his answer here when I get it.
Were you able to run the rest of BRAKER (if you proceeded with the run)?
Best, Tomas
Hi Tomas,
Thank you for following up. Hopefully Alex will get back to you. I ran braker in two modes and neither of the outputs were ideal. I ran once with intron locations used as a hints file and this crashed saying a dna.fa file was missing and when I ran with the bam file from an RNAseq run as hints the final results were not good. I ran BUSCO on the inferred amino acid sequences and got a score of about 0.5%. Unfortunately there are no errors being thrown along the way, this overflow is the first error I ran into when trying to see what had gone wrong.
All the best,
Tom
Hi Tom,
can you share the braker.log
and also a file named GeneMark-ET.stdout
, if you have it?
I'd like to take a look, maybe I'll see something else which is suspicious... I'll also share the files with Alex.
Best, Tomas
braker.upload.log GeneMark-ET.upload.stdout.txt
Hi Tomas,
This is a run that crashed due to a timeout error, so the braker log is incomplete. The GeneMark-ET step finished though, so this should be complete. I hope it is somehow useful to you.
Thanks,
Tom
Hi Tom,
Alex told me that he fixed the overflow error. Can you download the newest GeneMark version from http://exon.gatech.edu/GeneMark/license_download.cgi and test whether it helped?
Best, Tomas
Hi Tomas,
Thank you for the fix! It looks like the genome size is correct now, but the A/C/G/T/Other content doesn't looks correct. Do you know if this is important? The input sequence is soft-masked, so the count of acgt=0 below is a bit concerning. I will let the run finish (will take a few days) and have a look at the output then:
LABEL data/dna.fna SEQUENCE_SIZE 20300469834 SEQUENCE_ACGT 6041540776 NT_A 1751020995 NT_C 1270295426 NT_G 1271612733 NT_T 1748611622 NT_N 0 NT_X 14258929058 SEQUENCE_atcg 0 NT_a 0 NT_c 0 NT_g 0 NT_t 0 NT_n 0 NT_x 0 SEQUENCE_other 14258929058 NT_X 14258929058 GC 42.1 RECORDS 1069
Many thanks again,
Tom
Hi Tomas,
The run finished and the results were not ideal. Below are the busco scores based on the augustus.hints.codingseq file:
C:0.2%[S:0.2%,D:0.0%],F:0.9%,M:98.9%
I have attached the braker log from the finished run in case it is useful
Thanks,
Tom