BRAKER icon indicating copy to clipboard operation
BRAKER copied to clipboard

differing results from identical BRAKER runs

Open krabapple opened this issue 3 years ago • 1 comments

I am wondering if this is normal BRAKER2 (v 2.1.4) behavior.

I had reason to run it multiple times using the same inputs and parameters (it was run in the mode that uses RNA seq and protein evidence from the same organism, at both stages (i.e., --prot_seq=
--prg=gth
--bam=
--gth2traingenes --softmasking
).
The genome is a small-ish haploid eukaryotic genome -- 180 Mb -- likely to be intron-sparse, from pre-BRAKER data. Transposable elements were soft masked before input. The protein input set was possibly too meager to be useful (533 proteins)

I found that the number of gene products predicted in augustis.hints.gtf increased in the first three BRAKER2 runs but decreased on the 4th, so I stopped there

augustus_hints1.aa:23223 augustus_hints2.aa:23757 augustus_hints3.aa:21885 augustus_hints4.aa:19656

Across different runs these results include exact duplications of gene models in at least two runs (n=19517), as well as alternate gene models for the same gene start-stop coordinate, as well as genes found in only 1 run ('unique'). Based on BLASTP results, it is likely that some of these 'unique' genes should be considered real.

Is this normal?

And if so, should one run BRAKER multiple times, to try to 'exhaust' the true gene space of the genome?

krabapple avatar Apr 05 '21 15:04 krabapple

Hello,

this is normal. In the 2.1.4 version, there is some stochasticity in the training of AUGUSTUS. Different training runs (with the same inputs) can result in slightly different statistical models which in turn yield different predictions.

This training randomness was removed in the last release (v2.1.6) -- current BRAKER always outputs the same prediction when identical inputs are used.

And if so, should one run BRAKER multiple times, to try to 'exhaust' the true gene space of the genome?

If you want to maximize prediction sensitivity, you could do that. The fluctuations will most likely be in regions in which there is little or no external evidence available (RNA-Seq or proteins), so these unique predictions are in general less reliable.

Best, Tomas

tomasbruna avatar Apr 05 '21 17:04 tomasbruna