BRAKER icon indicating copy to clipboard operation
BRAKER copied to clipboard

the effect of the --species parameter

Open krabapple opened this issue 5 years ago • 17 comments

In an older issue request (https://github.com/Gaius-Augustus/BRAKER/issues/349) i asked about getting different BRAKER output when running the same BRAKER script >1 times. From the reply I learned that older versions of BRAKER (like what I used) have stochasticity in them that makes this happen, but v2.1.6 eliminated it.

Using v2.16 I tested this and was dismayed at first to find that two 'identical' runs once again did NOT yield the same number of augustus.hints.aa sequences. But when I looked closer I realized I was actually changing one crucial thing all along.

When I try to run the same script twice, I get a warning that the 'species' parameter either needs to be respecified by using a new name, or re-used by employing the --useexisting parameter. Previously I had been simply re-specifying the --species file (e.g. changing it from 'TvApr21' to 'TvApr22' -- my species does not exist in the list supplied in augustus/config so I have to name it) . Sure enough, when I used --useexisting instead, the number of hints was identical from run to run.

My question here is what is actually supplied by the 'species' parameter, and how/why does it change from run to run (unless you use --useexisting) sufficiently to alter the output when everything else is the same?

[These BRAKER runs have all been in 'additional pipeline A or C' modes, as per the user manual, i.e., they use RNA-seq and closely-related protein evidence.]

krabapple avatar Apr 23 '21 17:04 krabapple

Hello,

I understand the confusion because identical runs in which only the "--species" name changed should indeed lead to identical results, that was the point of the fix in the new v2.1.6 version.

Can you share the braker.log files from the two runs that lead to different results? Maybe I'll be able to see what's going on from those.

Also, I wonder whether this problem exists only in the additional pipelines. Did you have a chance to run one of the "regular" pipelines twice to see whether you are also getting different results?

Thanks, Tomas

tomasbruna avatar Apr 23 '21 23:04 tomasbruna

I have not run a 'regular' pipeline with v 2.1.6 yet. I have only run additional pipeline A (several times) and , just once , additional pipeline C (I haven't checked its output yet).

Here are two braker2 logs that used additional pipeline A. The second one (Apr21_3) differs from the first (Apr21_1) in three respects : adding --gff3 (which has no effect, I verified this in other runs) , different output directory (but this too has no effect as verified in other runs), and changing --species .

brakerlogs.zip

count of augustus.hints.aa sequences: braker2_Apr21_1/augustus.hints.aa:44547 (braker2_Apr21_2/augustus.hints.aa:44547 <--useexisting employed in this run, along with --gff3. If you need braker.log, let me know) braker2_Apr21_3/augustus.hints.aa:42475

krabapple avatar Apr 24 '21 03:04 krabapple

Thanks for sharing the files. From the logs, it looks like the difference occurs around the point of removing non-redundant genes from the training set. Until then, the runs seem to be identical.

If you get a chance, can you share the following files from both runs?

  • traingenes.good.gtf
  • traingenes.good.fa
  • traingenes.good.nr.fa

You will need to run BRAKER with the --nocleanup option, otherwise, these files get deleted after BRAKER finishes. To get the files fast, you can use the option --geneMarkGtf=/Previous/run/GeneMark-ET/genemark.gtf to supply an existing GeneMark prediction from a previous run. Also, there is no need to wait until the AUGUSTUS training finishes, you can stop the process once you see those files were generated. Please also include the new braker logs, so that I can verify the difference occurred again in these new runs.

Thanks for helping to debug this.

Best, Tomas

tomasbruna avatar Apr 24 '21 04:04 tomasbruna

I ran the script for Aug21_1 again, with --useexisting and --nocleanup. None of these files were in output when it finished:

traingenes.good.gtf traingenes.good.fa traingenes.good.nr.fa

The only files with remotely similar names were these:

$ find . -name "*train*"
./GeneMark-ET/info/training.trace
./GeneMark-ET/info/training.general
./GeneMark-ET/data/training.fna
./GeneMark-ET/data/et_training.gff
./GeneMark-ET/data/training


Here is this braker.log

braker.log

krabapple avatar Apr 25 '21 01:04 krabapple

Hello,

the AUGUSTUS training is skipped with --useexisting flag, the parameter set specified by --useexisting is used instead. That's also why the results do not differ -- whatever the stochasticity is, it happens during AUGUSTUS training.

Please try a new run, with a new --species name.

Tomas

tomasbruna avatar Apr 26 '21 15:04 tomasbruna

OK. I started fresh with two 'debug' runs. The only differences between them were:

  1. in debug1, the --species=Apr26; in debug2 --species=Apr26_2
  2. the working_dirs were different

All of these other parameters were identical (they are as per additional pipeline A, as before, with --nocleanup added): --genome --prot_seq --prg --bam --softmasking --cores --gff3 --nocleanup

I have included the braker.log files along with these requested files: traingenes.good.gtf traingenes.good.fa traingenes.good.nr.fa

traingenes.zip

The respective # of augustus.hints.aa sequences are

> $ grep -c ">"  braker2_Apr26_debug*/augustus.hints.aa
braker2_Apr26_debug1/augustus.hints.aa:43448
braker2_Apr26_debug2/augustus.hints.aa:42735

krabapple avatar Apr 26 '21 21:04 krabapple

Thanks, @krabapple, I think we identified the point where the stochasticity happens, it seems to be during the translation of the training set right before making the non-redundant protein set.

I will look into this when I get a chance, hopefully by the end of this week.

tomasbruna avatar Apr 27 '21 00:04 tomasbruna

That's good news. But can you explain one more thing? The first BRAKER2 v2.1.6 run I did using this genome and RNA and protein input set, in additional pipeline A mode, output >44,500 protein sequences. Five subsequent runs using the same parameters, but changing --species=, not only fluctuated (the issue you are investigating) but also never approached 44,500 again. Is this just chance or is something else causing it?

Run Species AA hints
braker2_Apr21_1 --species=Apr21 44547
braker2_Apr21_3 --species=Apr21_3 42475
braker2_Apr21_4 --species=Apr21_1 43396
braker2_Apr22_1 --species=Apr22 43167
braker2_Apr26_debug1 --species=Apr26 43448
braker2_Apr26_debug2 --species=Apr26_2 42735

`

krabapple avatar Apr 27 '21 17:04 krabapple

I think this is just a chance. I would suspect that the fluctuating predictions are mostly the ones predicted purely ab initio, i.e. without support by external RNA or protein evidence.

tomasbruna avatar May 02 '21 17:05 tomasbruna

So that means I should expect different output if I simply change the --species name ?

Do you have any insight why repeating the run (changing only the --species name) never produced as many augustus.aa.hints as the first one? Chance?

On Sun, May 2, 2021, 1:21 PM Tomáš Brůna @.***> wrote:

I think this is just a chance. I would suspect that the fluctuating predictions are mostly the ones predicted purely ab initio, i.e. without support by external RNA or protein evidence.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Gaius-Augustus/BRAKER/issues/359#issuecomment-830841929, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALY7U5UK47D6LUMU2LTOTB3TLWCX7ANCNFSM43PAR3DQ .

krabapple avatar May 02 '21 19:05 krabapple

Do you have any insight why repeating the run (changing only the --species name) never produced as many augustus.aa.hints as the first one? Chance?

Yes, my best explanation is chance.

So that means I should expect different output if I simply change the --species name ?

Using a different species name starts a new AUGUSTUS training. The training involves splitting the full training gene set into smaller training subsets. This splitting should be deterministic, but it clearly is not in your case, and this changes the way AUGUSTUS model is trained each time.

tomasbruna avatar May 02 '21 19:05 tomasbruna

Is it confirmed to be deterministic in other cases where only --species is changed?

On Sun, May 2, 2021, 3:29 PM Tomáš Brůna @.***> wrote:

Do you have any insight why repeating the run (changing only the --species name) never produced as many augustus.aa.hints as the first one? Chance?

Yes, my best explanation is chance.

So that means I should expect different output if I simply change the --species name ?

Using a different species name starts a new AUGUSTUS training. The training involves splitting the full training gene set into smaller training subsets. This splitting should be deterministic, but it clearly is not in your case, and this changes the way AUGUSTUS model is trained each time.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Gaius-Augustus/BRAKER/issues/359#issuecomment-830859276, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALY7U5XQXVU2D5DLZEXE3XDTLWRX3ANCNFSM43PAR3DQ .

krabapple avatar May 02 '21 19:05 krabapple

I think it depends on a specific system setup. For example, the stochasticity I fixed recently was occurring only on some Linux systems.

This is probably the case here too because I never observed such fluctuations on our system during BRAKER testing, i.e. BRAKER is deterministic for me when only --species is changed.

Thanks to your help, I now know where in the code this happens and I'll be trying to reproduce the randomness on other machines, to fix it.

tomasbruna avatar May 02 '21 19:05 tomasbruna

Which output file contains information telling me which predictions are purely ab initio, and which were supported by evidence?

krabapple avatar May 02 '21 23:05 krabapple

This information is not yet in any of the outputs, but there is a way to get it with a script.

The script is called selectSupportedSubsets.py and it's located in a separate branch here https://github.com/Gaius-Augustus/BRAKER/tree/report/scripts/predictionAnalysis. To run it, you will need to have the predictionAnalysis.py file saved in the same folder.

The help message (selectSupportedSubsets.py --help) explains what the different supported sets mean.

tomasbruna avatar May 03 '21 20:05 tomasbruna

What system(s) do you use that did NOT show the stochastic behavior?

krabapple avatar May 05 '21 22:05 krabapple

I am mostly testing BRAKER2 on Red Hat 6.10 and it is stable there.

tomasbruna avatar May 06 '21 03:05 tomasbruna