BRAKER icon indicating copy to clipboard operation
BRAKER copied to clipboard

BRAKER2 failing to collect RNA-seq hints (augustus.tmp1.gff, then augustus.hints.gff) during 8-CPU operation

Open SchwarzEM opened this issue 3 years ago • 14 comments

I am trying to run BRAKER2 with 8 CPUs on a moderately large and complex nematode genome (150 Mb) for which I have RNA-seq data (in an indexed and sorted BAM file). Try as I may, I can only get BRAKER2 to work up to the point that it creates optimized species-specific AUGUSTUS parameters, splits the genome 8 ways, and generates many different RNA-seq hints. But it then fails at the step where it needs to collect all of those separate hints into a single file ("augustus.hints.gff") that will support genome-wide gene prediction by AUGUSTUS.

All of the software I am running was locally compiled from source code. In particular, BRAKER2 was downloaded from github rather than from bioconda.

For AUGUSTUS, I encountered a strange bug with etraining compiled from the most recent github version, but did not have this error from AUGUSTUS 3.3.3. Since BRAKER2 sensitively depends on auxillary AUGUSTUS Perl scripts that have been changed since 3.3.3, I set up a hybrid AUGUSTUS where the binaries were compiled from 3.3.3 but the Perl scripts were copied from the very latest github version (downloaded yesterday).

Here are the final messages in braker.log that BRAKER2 emitted immediately before it failed and stopped running:

# Sat Jul 11 01:22:56 2020: Joining AUGUSTUS predictions in directory /home/ems/rhabditella/prelim_preds_2020.07/braker2/rna_pred_01/augustus_tmp

# Sat Jul 11 01:22:56 2020: Concatenating AUGUSTUS output files in /home/ems/rhabditella/prelim_preds_2020.07/braker2/rna_pred_01/augustus_tmp
perl /home/ems/src/augustus-3.3.3/scripts/join_aug_pred.pl < /home/ems/rhabditella/prelim_preds_2020.07/braker2/rna_pred_01/augustus.tmp1.gff > /home/ems/rhabditella/prelim_preds_2020.07/braker2/rna_pred_01/augustus.hints.gff

Another error message was sent to STDERR. It is:

sh: 1: cannot open /home/ems/rhabditella/prelim_preds_2020.07/braker2/rna_pred_01/augustus.tmp1.gff: No such file

ERROR in file /home/ems/src/BRAKER2_2020.07.09/scripts/braker.pl at line 9251

Failed to execute perl /home/ems/src/augustus-3.3.3/scripts/join_aug_pred.pl \ 
 < /home/ems/rhabditella/prelim_preds_2020.07/braker2/rna_pred_01/augustus.tmp1.gff \
 > /home/ems/rhabditella/prelim_preds_2020.07/braker2/rna_pred_01/augustus.hints.gff

As the STDERR message indicates, there is no augustus.tmp1.gff file present in the working directory, so whatever Perl script in BRAKER2 or AUGUSTUS that was supposed to collect the individual separate hints files clearly did not function. I cannot easily tell from the various log and error files what that specific Perl script was supposed to be. Note that, although augustus.tmp1.gff is missing, BRAKER2 did successfully create 121 *.hints files (e.g., 72.001.Raxei_Eur_v1.29.fa.1..1252500.hints) in the directory augustus_tmp; these files contain a total of 244,215 lines of text. There was clearly raw material from which a augustus.tmp1.gff might have been generated! But it was not.

So, my questions:

Has anybody encountered this particular bug before?

Can anybody describe what particular script or program of BRAKER2 (or AUGUSTUS) is supposed to be gathering up the individual hints files, and where I should be able to find a working copy?

SchwarzEM avatar Jul 11 '20 18:07 SchwarzEM

It may be helpful to describe the setup and line-commands that I used to do BRAKER2, so here they are:

# Set up and get into workspace:
mkdir $HOME/rhabditella/prelim_preds_2020.07/braker2/rna_pred_01 ;
cd $HOME/rhabditella/prelim_preds_2020.07/braker2/rna_pred_01 ;
# Set up local and completely writeable AUGUSTUS config directory:
rsync -av $HOME/src/augustus-3.3.3/config . ;
mv -i config local_augustus_config ;
chmod a+w -R local_augustus_config ;
# Start batch job:
sbatch job_farm_raxei_braker2_rna-seq_only_2020.07.10.01.sh ;

Contents of job_farm_raxei_braker2_rna-seq_only_2020.07.10.01.sh:

#!/bin/bash -login
#SBATCH --nodes=1
#SBATCH --partition=bmm
#SBATCH --time=072:00:00
#SBATCH --cpus-per-task=8
#SBATCH --job-name=job_farm_raxei_braker2_rna-seq_only_2020.07.10.01.sh
#SBATCH --mem=256gb
#SBATCH --mail-type=ALL
#SBATCH [email protected]

# Begin in working directory of job:
cd $HOME/rhabditella/prelim_preds_2020.07/braker2/rna_pred_01 ;

# Set up biopython environment with Python3, which part of BRAKER2 would require:
. $HOME/anaconda2/etc/profile.d/conda.sh ;
conda activate biopython_1.77 ;

# Also set up Boost libraries, which were used to do source compilation of AUGUSTUS et al.
module load boost/1.71.0 ;

# Now, set up a whole bunch of shell variables to direct BRAKER2 properly:
GENEMARK_PATH=/home/ems/src/gmes_linux_64/ ;
export GENEMARK_PATH ;

# Note that this particular config directory is local:
AUGUSTUS_CONFIG_PATH=/home/ems/rhabditella/prelim_preds_2020.07/braker2/rna_pred_01/local_augustus_config/ ;
export AUGUSTUS_CONFIG_PATH ;

# These two AUGUSTUS directories are from source code:
AUGUSTUS_BIN_PATH=/home/ems/src/augustus-3.3.3/bin/ ;
export AUGUSTUS_BIN_PATH ;

# Note that these scripts were copied from a github version on 7/10/2020:
AUGUSTUS_SCRIPTS_PATH=/home/ems/src/augustus-3.3.3/scripts/ ;
export AUGUSTUS_SCRIPTS_PATH ;

# Set up a lot more paths; each of the following were downloaded from github and compiled from source code.
BAMTOOLS_PATH=/home/ems/src/bamtools_2020.07.09/bin/ ;
export BAMTOOLS_PATH ;
DIAMOND_PATH=/home/ems/src/diamond-linux64/bin/ ;
export DIAMOND_PATH ;
SAMTOOLS_PATH=/home/ems/src/samtools_2020.07.09/bin/ ;
export SAMTOOLS_PATH ;
CDBTOOLS_PATH=/home/ems/src/cdbfasta_2020.07.09/bin/ ;
export CDBTOOLS_PATH ;
ALIGNMENT_TOOL_PATH=/home/ems/src/ProtHint_2020.07.09/bin/ ;
export ALIGNMENT_TOOL_PATH ;
MAKEHUB_PATH=/home/ems/src/MakeHub_2020.07.09/ ;
export MAKEHUB_PATH ;

# Add BRAKER2's scripts to the PATH:
PATH=/home/ems/src/BRAKER2_2020.07.09/scripts:$PATH ;
export PATH ;

# Now start the actual BRAKER2 run:
braker.pl --cores 8 --crf --gff3 --species=Raxei_rnaseq_2020.07.10.01 \
--softmasking \
--workingdir=/home/ems/rhabditella/prelim_preds_2020.07/braker2/rna_pred_01/ \
--augustus_args="--strand=both --genemodel=partial --noInFrameStop=true --singlestrand=false \
--maxtracks=3 --alternatives-from-sampling=true --alternatives-from-evidence=true \
--minexonintronprob=0.1 --minmeanexonintronprob=0.4 --uniqueGeneId=true --protein=on \
--introns=on --start=on --stop=on --cds=on --codingseq=on --UTR=off --progress=true \
--gff3=on --outfile=raxei_rnaseq_2020.07.10.01.aug" \
--bam=/home/ems/rhabditella/prelim_preds_2020.07/braker2/evidence/raxei_rnaseq_all_khmer.hisat2.Raxei_Eur_v1.sorted.bam \
--genome=/home/ems/rhabditella/assemblies/Raxei_Eur_v1.smask.no_comms.fa  ;

# Close things down:
module unload boost/1.71.0 ;
conda deactivate ;

SchwarzEM avatar Jul 11 '20 18:07 SchwarzEM

One other comment:

Doing everything as before, but with only one CPU instead of eight CPUs (braker.pl --cores 1 instead of braker.pl --cores 8), I was able to get BRAKER2 to complete a successful RNA-seq-guided run on my genome!

So, the problem here is not that I have set up the software so poorly that it cannot run at all. The problem seems to be that, no matter how hard I try to get it right, I cannot get BRAKER2 to handle multi-CPU operations; as far as I can tell, the problem arises when BRAKER2 needs to produce a single aggregated hint file (augustus.tmp1.gff) from a great many smaller, parallel-generated hint files.

SchwarzEM avatar Jul 12 '20 22:07 SchwarzEM

Hello @SchwarzEM,

thank you for this report.

We've recently made some changes to BRAKER2 code, including the section related to multi-CPU operations.

If you have time in the future, could you try the new code to see if the problem persists? Also, could you try to run the test1.sh in the example folder to see if you get the same error?

Thanks! Tomas

tomasbruna avatar Aug 07 '20 20:08 tomasbruna

Hi @tomasbruna,

Thank you for updating me on the bugfixes!

I will be using BRAKER2 again in the near future, which should give me a convenient opportunity to see if multi-CPU operation works in my hands with the newest BRAKER2 code.

Best,

--Erich

SchwarzEM avatar Aug 08 '20 00:08 SchwarzEM

To anyone who had the same error and is looking through the Issues for a solution -

I was getting the same error but it was because my .hints files were not being generated correctly. I had a semicolon in my scaffold names which was messing up the string-parsing in some parts. Once I took the semicolon out of all of my scaffold names, it ran just fine on 8, 24, and 48 cores.

I also could only get it to run using STAR for alignment; when I used bowtie2 I kept on getting the problem addressed in the first bullet of the common problems section of the documentation, even after removing all non-alphanumerical characters from my scaffold names and shortening them.

jdavidpeery avatar Jun 17 '21 21:06 jdavidpeery

So, I've never put a semicolon in a scaffold name in my entire life.

I have, however, often put in underscores. Underscores are considered generally acceptable in UNIX-type names as a way to space out text blocks and make them more readable.

Does BRAKER2 have the same string-parsing problem with underscores that it has with semicolons? Because that would explain why I kept having this problem, despite no semicolons.

Also, if BRAKER2 is intended to work only with STAR as an RNA-seq read mapper, and is not intended to work with many other mainstream RNA-seq readmappers such as bowtie2, this fact should at least be made more prominent in its documentation. (If inability to work with bowtie2 is not meant to be a feature, then hopefully such inability will be debugged in the not too remote future...)

SchwarzEM avatar Jun 17 '21 21:06 SchwarzEM

So from this post - https://github.com/Gaius-Augustus/BRAKER/issues/44 it looks like BRAKER2 only works with "mappers that produced spliced alignments ... (Bowtie is not made for this.)". I agree that it should be more prominent in the documentation that BRAKER2 will not work with mappers that don't produce spliced alignments, such as bowtie2. However, that post also indicates that it shouldn't be STAR-specific, and should work with other aligners such as GSNAP, Hisat2 etc.

As far as underscores in the scaffold names go, I'm not sure, hopefully the developers have a better answer.

jdavidpeery avatar Jun 18 '21 15:06 jdavidpeery

Underscores in scaffold names should not cause any problems in BRAKER.

On Fri, Jun 18, 2021 at 5:57 PM David Peery @.***> wrote:

So from this post - #44 https://github.com/Gaius-Augustus/BRAKER/issues/44 it looks like BRAKER2 only works with "mappers that produced spliced alignments ... (Bowtie is not made for this.)". I agree that it should be more prominent in the documentation that BRAKER2 will not work with mappers that don't produce spliced alignments, such as bowtie2. However, that post also indicates that it shouldn't be STAR-specific, and should work with other aligners such as GSNAP, Hisat2 etc.

As far as underscores in the scaffold names go, I'm not sure, hopefully the developers have a better answer.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Gaius-Augustus/BRAKER/issues/236#issuecomment-864134486, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJMC6JHHAP45BNL3VZTXM2TTTNUG3ANCNFSM4OXNAQZA .

KatharinaHoff avatar Jun 19 '21 12:06 KatharinaHoff

Hello developers,

I experience the same problem as @SchwarzEM. I ran all the braker2 tests and all of them ran successfully. I am also able to run braker2 with "default" settings (using genome, rnaseq data and also proteomes), however when I am trying to run braker2 with different Augustus options like this:

--augustus_args="--strand=both --genemodel=partial --noInFrameStop=true \
--maxtracks=3 --alternatives-from-sampling=true --alternatives-from-evidence=true \
--minexonintronprob=0.1 --minmeanexonintronprob=0.4 --uniqueGeneId=true --protein=on \
--introns=on --start=on --stop=on --cds=on --codingseq=on --UTR=off --progress=true \
--gff3=on --outfile=augustus.hints.09.10.2021.aug"

I get the same error in the same step. Braker2 just freezes and ends immediately during this step and there is no error or warning message anywhere (slurm manager, log, braker2 log).

Concatenating AUGUSTUS output files in /ocean/projects/mcb190015p/ilikvlad/necator/necator_gene_predict_demo_08.03.2021/tmp/braker/augustus_tmp
	perl /ocean/projects/mcb190015p/ilikvlad/anaconda3/envs/augustus_3.4.0/bin/join_aug_pred.pl < /ocean/projects/mcb190015p/ilikvlad/necator/necator_gene_predict_demo_08.03.2021/tmp/braker/augustus.tmp1.gff > /ocean/projects/mcb190015p/ilikvlad/necator/necator_gene_predict_demo_08.03.2021/tmp/braker/augustus.hints.gff

I wonder, could it be problem of a memory usage or multi-threading issue, as I usually use many cores (32 or 48) for running braker2 and the script just can't handle multi-threading? I tried toage or multi-threading issue, as I usually use many cores (32 or 48) for running braker2 and the script just can't handle m run it using only 1CPU as @SchwarzEM , however I can'ŧ reach this step as the job gets timed out before it reaches this phase.

Any advice on this problem?

Best,

Vladislav

ilikvlad avatar Sep 27 '21 19:09 ilikvlad

Hi Vladislav,

I've successfully run BRAKER with modified AUGUSTUS settings before (which were somewhat similar to yours).

I wonder, could it be problem of a memory usage or multi-threading issue

This could be an issue because AUGUSTUS used much more memory during my modified runs (due to --alternatives-from-sampling=true) than in the default BRAKER mode.

the script just can't handle m run it using only 1CPU as @SchwarzEM , however I can'ŧ reach this step as the job gets timed out before it reaches this phase.

You could try your modified run with the example data. The example input is small, it should get to the problematic stage within hours even with just 1 CPU.

Also, did you try running BRAKER without specifying --outfile=augustus.hints.09.10.2021.aug in the --augustus_args? I wonder whether that might be causing any issues.

Best, Tomas

tomasbruna avatar Oct 12 '21 15:10 tomasbruna

Hello Tomáš,

thank you for your response and suggestions on this issue. I ran a bunch of tests, trying multiple braker2 runs with different augustus settings and the results are as following:

You could try your modified run with the example data. The example input is small, it should get to the problematic stage within hours even with just 1 CPU.

When I ran it using example data, braker2 was able to finish the run with all my augustus options, however did not provide *.gff3 only *.gtf file and after a close inspection, I could not find any Augustus predicitons in the table at all, which suggests, braker2 was not able to handle the augustus output and merge it with genemark output properly. Also in the braker.log, there is an evidence of braker2 deleting *.gff3 because the file was empty (I attached the braker.log for you closer inspection).

Also, did you try running BRAKER without specifying --outfile=augustus.hints.09.10.2021.aug in the --augustus_args? I wonder whether that might be causing any issues.

I did some of these runs, and again braker2 run was able to finish properly, however without *.gff3 output (deleting empty *.gff3 file) and with genemark predictions only.

This could be an issue because AUGUSTUS used much more memory during my modified runs (due to --alternatives-from-sampling=true) than in the default BRAKER mode.

You were right, after specifying --alternatives-from-sampling=false and not specifying --outfile=augustus.hints.09.10.2021.aug, braker2 had a successful run even for our huge RNAseq dataset, however as in the previous cases, no *gff3 output nor augustus predictions were generated after the run.

To summarize this, I was able to make progress in running braker2 even with our data and complete the whole runs, however it (or me) is still not able able to handle different augustus options and merging augustus output with genemark output into *.gff3 file.

Any advice? My braker.log is attached below.

Many thanks, Best,

Vladislav

braker.log

ilikvlad avatar Oct 22 '21 18:10 ilikvlad

Hi Vladislav,

the absence of a .gff3 file is not a concern. We are working on making this better, but in the meantime, you can always convert the gtf output to gff3 with GenomeTools, like this:

gt gtf_to_gff3 <(grep -P "\tCDS\t|\texon\t" braker.gtf ) > braker.gff

On the other hand, the absence of any AUGUSTUS results in braker.gtf is definitely a big concern. Is there an output file called augustus.hints.gtf with just AUGUSTUS predictions? If not, I will try to run the same example you did to figure out what's going on (to determine which additional AUGUSTUS option is not compatible with BRAKER).

Best, Tomas

tomasbruna avatar Oct 22 '21 20:10 tomasbruna

Hi Tomas,

I did some additional braker2 runs with our/example datasets and I think I finally found where the problem is.

  1. run: example dataset + augustus options + 1CPU only - completed run with desired output
  2. run: example dataset + augustus options + multiple CPUs - not completed run, no *.gff3 output, no augutus predictions
  3. run: our dataset (RNAseq downsampled to 20 %) + augustus options + 1CPU only - not completed - ran out of memory
  4. run: our dataset (RNAseq downsampled to 5 %) + augustus options + 1CPU only - not completed - timed-out, however it seemed it was running properly, just when using 1CPU, it takes a lot of time, and on the cluster we use, max. is 24 hours.
  5. run: our dataset + default options + multiple cores - completed run with desired output (however no advanced augustus predictions)
  6. run: our dataset + genomic mode + multiple cores - completed run with desired output

So it seems, braker2 can be run properly with only 1 CPU, as it is not able to handle multiple CPUs with advanced augustus parameters, however we can’t use it for our data, as we have quite large RNAseq dataset and even when downsampled to a small scale, the *.bam file is much bigger than the example dataset and it runs out of memory or it gets timed out.

Any advice on this?

Best,

Vladislav

ilikvlad avatar Nov 20 '21 13:11 ilikvlad

The AUGUSTUS options that you are using mainly play a role in the prediction step, not for training. You could run a standard training without the parameters in question. This will produce species-specific parameters. Then you split your large genome into chunks, e.g. bundles of scaffolds, or along chromosomes. Run BRAKER without training using the previously trained parameters on your sequence chunks.

When merging results, you need to be a careful: every run will have a "g1.t1", i.e. you have to increment the gene names while merging results.

KatharinaHoff avatar Jan 07 '22 12:01 KatharinaHoff