the effect of the --species parameter
In an older issue request (https://github.com/Gaius-Augustus/BRAKER/issues/349) i asked about getting different BRAKER output when running the same BRAKER script >1 times. From the reply I learned that older versions of BRAKER (like what I used) have stochasticity in them that makes this happen, but v2.1.6 eliminated it.
Using v2.16 I tested this and was dismayed at first to find that two 'identical' runs once again did NOT yield the same number of augustus.hints.aa sequences. But when I looked closer I realized I was actually changing one crucial thing all along.
When I try to run the same script twice, I get a warning that the 'species' parameter either needs to be respecified by using a new name, or re-used by employing the --useexisting parameter. Previously I had been simply re-specifying the --species file (e.g. changing it from 'TvApr21' to 'TvApr22' -- my species does not exist in the list supplied in augustus/config so I have to name it) . Sure enough, when I used --useexisting instead, the number of hints was identical from run to run.
My question here is what is actually supplied by the 'species' parameter, and how/why does it change from run to run (unless you use --useexisting) sufficiently to alter the output when everything else is the same?
[These BRAKER runs have all been in 'additional pipeline A or C' modes, as per the user manual, i.e., they use RNA-seq and closely-related protein evidence.]
Hello,
I understand the confusion because identical runs in which only the "--species" name changed should indeed lead to identical results, that was the point of the fix in the new v2.1.6 version.
Can you share the braker.log files from the two runs that lead to different results? Maybe I'll be able to see what's going on from those.
Also, I wonder whether this problem exists only in the additional pipelines. Did you have a chance to run one of the "regular" pipelines twice to see whether you are also getting different results?
Thanks, Tomas
I have not run a 'regular' pipeline with v 2.1.6 yet. I have only run additional pipeline A (several times) and , just once , additional pipeline C (I haven't checked its output yet).
Here are two braker2 logs that used additional pipeline A. The second one (Apr21_3) differs from the first (Apr21_1) in three respects : adding --gff3 (which has no effect, I verified this in other runs) , different output directory (but this too has no effect as verified in other runs), and changing --species .
count of augustus.hints.aa sequences: braker2_Apr21_1/augustus.hints.aa:44547 (braker2_Apr21_2/augustus.hints.aa:44547 <--useexisting employed in this run, along with --gff3. If you need braker.log, let me know) braker2_Apr21_3/augustus.hints.aa:42475
Thanks for sharing the files. From the logs, it looks like the difference occurs around the point of removing non-redundant genes from the training set. Until then, the runs seem to be identical.
If you get a chance, can you share the following files from both runs?
- traingenes.good.gtf
- traingenes.good.fa
- traingenes.good.nr.fa
You will need to run BRAKER with the --nocleanup option, otherwise, these files get deleted after BRAKER finishes. To get the files fast, you can use the option --geneMarkGtf=/Previous/run/GeneMark-ET/genemark.gtf to supply an existing GeneMark prediction from a previous run. Also, there is no need to wait until the AUGUSTUS training finishes, you can stop the process once you see those files were generated. Please also include the new braker logs, so that I can verify the difference occurred again in these new runs.
Thanks for helping to debug this.
Best, Tomas
I ran the script for Aug21_1 again, with --useexisting and --nocleanup. None of these files were in output when it finished:
traingenes.good.gtf traingenes.good.fa traingenes.good.nr.fa
The only files with remotely similar names were these:
$ find . -name "*train*"
./GeneMark-ET/info/training.trace
./GeneMark-ET/info/training.general
./GeneMark-ET/data/training.fna
./GeneMark-ET/data/et_training.gff
./GeneMark-ET/data/training
Here is this braker.log
Hello,
the AUGUSTUS training is skipped with --useexisting flag, the parameter set specified by --useexisting is used instead. That's also why the results do not differ -- whatever the stochasticity is, it happens during AUGUSTUS training.
Please try a new run, with a new --species name.
Tomas
OK. I started fresh with two 'debug' runs. The only differences between them were:
- in debug1, the --species=Apr26; in debug2 --species=Apr26_2
- the working_dirs were different
All of these other parameters were identical (they are as per additional pipeline A, as before, with --nocleanup added): --genome --prot_seq --prg --bam --softmasking --cores --gff3 --nocleanup
I have included the braker.log files along with these requested files: traingenes.good.gtf traingenes.good.fa traingenes.good.nr.fa
The respective # of augustus.hints.aa sequences are
> $ grep -c ">" braker2_Apr26_debug*/augustus.hints.aa
braker2_Apr26_debug1/augustus.hints.aa:43448
braker2_Apr26_debug2/augustus.hints.aa:42735
Thanks, @krabapple, I think we identified the point where the stochasticity happens, it seems to be during the translation of the training set right before making the non-redundant protein set.
I will look into this when I get a chance, hopefully by the end of this week.
That's good news. But can you explain one more thing? The first BRAKER2 v2.1.6 run I did using this genome and RNA and protein input set, in additional pipeline A mode, output >44,500 protein sequences. Five subsequent runs using the same parameters, but changing --species=, not only fluctuated (the issue you are investigating) but also never approached 44,500 again. Is this just chance or is something else causing it?
| Run | Species | AA hints |
|---|---|---|
| braker2_Apr21_1 | --species=Apr21 | 44547 |
| braker2_Apr21_3 | --species=Apr21_3 | 42475 |
| braker2_Apr21_4 | --species=Apr21_1 | 43396 |
| braker2_Apr22_1 | --species=Apr22 | 43167 |
| braker2_Apr26_debug1 | --species=Apr26 | 43448 |
| braker2_Apr26_debug2 | --species=Apr26_2 | 42735 |
`
I think this is just a chance. I would suspect that the fluctuating predictions are mostly the ones predicted purely ab initio, i.e. without support by external RNA or protein evidence.
So that means I should expect different output if I simply change the --species name ?
Do you have any insight why repeating the run (changing only the --species name) never produced as many augustus.aa.hints as the first one? Chance?
On Sun, May 2, 2021, 1:21 PM Tomáš Brůna @.***> wrote:
I think this is just a chance. I would suspect that the fluctuating predictions are mostly the ones predicted purely ab initio, i.e. without support by external RNA or protein evidence.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Gaius-Augustus/BRAKER/issues/359#issuecomment-830841929, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALY7U5UK47D6LUMU2LTOTB3TLWCX7ANCNFSM43PAR3DQ .
Do you have any insight why repeating the run (changing only the --species name) never produced as many augustus.aa.hints as the first one? Chance?
Yes, my best explanation is chance.
So that means I should expect different output if I simply change the --species name ?
Using a different species name starts a new AUGUSTUS training. The training involves splitting the full training gene set into smaller training subsets. This splitting should be deterministic, but it clearly is not in your case, and this changes the way AUGUSTUS model is trained each time.
Is it confirmed to be deterministic in other cases where only --species is changed?
On Sun, May 2, 2021, 3:29 PM Tomáš Brůna @.***> wrote:
Do you have any insight why repeating the run (changing only the --species name) never produced as many augustus.aa.hints as the first one? Chance?
Yes, my best explanation is chance.
So that means I should expect different output if I simply change the --species name ?
Using a different species name starts a new AUGUSTUS training. The training involves splitting the full training gene set into smaller training subsets. This splitting should be deterministic, but it clearly is not in your case, and this changes the way AUGUSTUS model is trained each time.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Gaius-Augustus/BRAKER/issues/359#issuecomment-830859276, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALY7U5XQXVU2D5DLZEXE3XDTLWRX3ANCNFSM43PAR3DQ .
I think it depends on a specific system setup. For example, the stochasticity I fixed recently was occurring only on some Linux systems.
This is probably the case here too because I never observed such fluctuations on our system during BRAKER testing, i.e. BRAKER is deterministic for me when only --species is changed.
Thanks to your help, I now know where in the code this happens and I'll be trying to reproduce the randomness on other machines, to fix it.
Which output file contains information telling me which predictions are purely ab initio, and which were supported by evidence?
This information is not yet in any of the outputs, but there is a way to get it with a script.
The script is called selectSupportedSubsets.py and it's located in a separate branch here https://github.com/Gaius-Augustus/BRAKER/tree/report/scripts/predictionAnalysis. To run it, you will need to have the predictionAnalysis.py file saved in the same folder.
The help message (selectSupportedSubsets.py --help) explains what the different supported sets mean.
What system(s) do you use that did NOT show the stochastic behavior?
I am mostly testing BRAKER2 on Red Hat 6.10 and it is stable there.