BRAKER Error during BRAKER3 RNASeq+Proteins

Hello,

I've tried to annotate genome with both RNASeq and proteins database. For certain genomes it worked but for some others it failed with similar errors as described in #582 and #577 (see logs below). However, the cause seems to be different because I've already meet the solutions proposed in these issues:

RNASeq alignments were produced using HiSAT2.1.2 with the -dta option
Proteins consist of orthodDB v11 produced by T. Bruna concatenated with close relative species without description in the header nor "*"/"." in the sequences. This database was successfully used for annotating other genomes without RNASeq data.

I am using the singularity container pulled the 28/03/2023

Do you have an idea on what's going on and how to solve this issue?

Thank you!

Here are the different logs:

GeneMark-ETP.stderr

FASTA index file /home/ICE/jkeller/Eidmes/GeneMark-ETP/data/genome.softmasked.fasta.fai created. error on open file /opt/ETP/bin/gmes/build_mod.pl: ATG.mat sed: can't read output.mod: No such file or directory 13-Apr-23 17:50:48 - INFO: Finding masking penalty maximizing the number of correctly predicted reliable exons in range from 0 to 0.2 with step 0.04 13-Apr-23 17:50:48 - INFO: Running prediction with masking penalty = 0 error: Program exited due to an error in command: /opt/ETP/bin/gmes/gmes_petap.pl --seq /home/ICE/jkeller/Eidmes/GeneMark-ETP/proteins.fa/penalty/contigs0jjh2vrp.fasta --soft_mask 1000 --max_mask 40000 --predict_with /home/ICE/jkeller/Eidmes/GeneMark-ETP/proteins.fa/model/output.mod --cores 8 --mask_penalty 0 error, file not found: option --f1 prothint/prothint.gff grep: prothint/evidence.gff: No such file or directory grep: prothint/evidence.gff: No such file or directory Traceback (most recent call last): File "/opt/ETP/bin/printRnaAlternatives.py", line 353, in main() File "/opt/ETP/bin/printRnaAlternatives.py", line 289, in main candidates = loadIntrons(args.genemark) File "/opt/ETP/bin/printRnaAlternatives.py", line 193, in loadIntrons for row in csv.reader(open(inputFile), delimiter='\t'): FileNotFoundError: [Errno 2] No such file or directory: 'pred_m/genemark.gtf' error, file not found: option --f1 prothint/prothint.gff grep: prothint/evidence.gff: No such file or directory grep: prothint/evidence.gff: No such file or directory Died at /opt/ETP/bin/format_back.pl line 14. Died at /opt/ETP/bin/format_back.pl line 14. error on open file /opt/ETP/bin/gmes/build_mod.pl: ATG.mat sed: can't read output.mod: No such file or directory error, file not found: option --f1 prothint/prothint.gff grep: prothint/evidence.gff: No such file or directory grep: prothint/evidence.gff: No such file or directory Died at /opt/ETP/bin/format_back.pl line 14. Died at /opt/ETP/bin/format_back.pl line 14.

End of GeneMark-ETP.stdout

from file training.list parsed IDs: 5832 not found in input: 0 done error: all sequences not of same lengthon line 878: TCCTATGGTT 30630 27048 3582 88.31 p_hints_nonhc.gtf 75249 27048 48201 35.94 r_hints_nonhc.gtf number of transcripts in file: 6638 /home/ICE/jkeller/Eidmes/GeneMark-ETP/proteins.fa/genemark_supported.gtf number of genes in set: 5832 removed partial: 0 genes found for training: 5832 error: all sequences not of same lengthon line 2185: TCCTATGGTT

get_etp_hints.stderr

Died at /opt/ETP/bin/format_back.pl line 14.

Apr 14 '23 06:04 jeankeller

@MarioStanke @alexlomsadze this seems to be another GeneMark-ETP issue.

Apr 14 '23 08:04 KatharinaHoff

Hi @jeankeller Have you figure out what happened? I have the same situation with you. I using orthodDB v11 and it worked, but when I wants to predicted another genome with the same database, it showed the same errors as you. I also sure that I used --dta option in my Hisat2 alignment. Any suggestion? Thanks.

Oct 03 '23 09:10 scintilla9

Hi @scintilla9,

Unfortunately no. I tried to modify some parameters (such as minimal contig length) and different BRAKER3 version, nothing improved the situation. We contacted the GeneMark developper 3 months ago but we got no answer yet. A similar issue on Github suggested to increase the amount of RNASeq but in my case, I have samples with one RNASeq dataset that succeed other fails and reciprocally samples with ~100 RNASeq datatsets failed with the same error... We ended up to remove RNASeq data and rely on protein homology to annotate genomes. What is weird is that the BRAKER "successfully" finished and produced the final output files despite the failure of GeneMark (@KatharinaHoff , maybe something to add to the BRAKER log?). We also resequenced some librairies to check whether adding more RNASeq could solve the problem and we'll have the answer next week. Sorry to not have better news

Oct 03 '23 10:10 jeankeller

Hi @jeankeller

Thanks for sharing your experiences. That helps a lot.

In my case, the failed one actually contains more RNASeq data than the successful one (judging by the reads numbers and the mapping rating, and their genome size are both about 1G). I think for now, I will try to annotate the genome with RNASeq data and protein homology separately, then combine the results with TSEBRA. And of course, it should be based on both procedure successfully finished. (Hi @KatharinaHoff, does this sound reasonable to you?)

Thanks.

Oct 04 '23 01:10 scintilla9

Hi, @jeankeller

I ended up succeeded with braker3 by decreasing the number of scaffolds in the genome. With about 1.4K scaffolds (scaffold length > 10,000 bp), braker3 with RNA+Protein data finished without the error above. But with 5.2K scaffolds (scaffold length > 5,000 bp), the same error happened.

Hope this helps.

Oct 24 '23 08:10 scintilla9

Hi @scintilla9

Thanks for the update. I also thought about this but turns out that some genome with more than 10K contigs were successfully annotated with protein+RNAseq while some with less than 1K scaffolds failed. Maybe it's related to the distribution of RNASeq data along the contigs... we sent multiple example files to the GeneMark dev team but got any answers since months now

Best, Jean

Oct 24 '23 09:10 jeankeller

Hi @scintilla9, GeneMark team found the issue and fixed it (see #648). GeneMark has to be updated to version 1.02, I don't know when the fix will be uploaded to the singularity container.

Best, Jean

Nov 06 '23 23:11 jeankeller

Hi @jeankeller, thanks for the information, will keep an eye on that.

Nov 07 '23 07:11 scintilla9

BRAKER BRAKER copied to clipboard

Error during BRAKER3 RNASeq+Proteins

BRAKER
BRAKER copied to clipboard