beast2
beast2 copied to clipboard
beast2vs1 test for EBSP fails
The posterior of the testEBSP.xml fails for TreePriorTest. At some point it must have passed. When did this change and why? This might be a symptom of a bug introduced at some point. Finding out which commit made the test fail would be a good starting point.
Hi Remco, after my investigation, it seemed the expectations for EBSP were totally wrong. I barely remember it was in development and might have been excluded from the test suits in the previous Ant call. I cannot figure it out when it started to be active. I apology for this mistaken in the past.
So I ran the testEBSP.xml using BEAST 1.8.4, and corrected all exceptions in the EBSP test using the result of attached log. But it seemed there was still a gap between two posteriors.
Expected posterior delta mean: -3841.2851 - -3762.515419489472 <= delta stdErr: 2*(8.559 + 3.0511764457087116)
Two logs are attached.
BEAST 1 b1.tar.gz. testEBSP.xml
BEAST 2 b2.tar.gz. testEBSP.xml
There are 5 more parameters in another tests not passed as attached: results.tar.gz
There were few problems to run BEAST 1:
- I could not use BEAST 1.10.* in my Mac, which forces to use beagle 3, but throws exception when using it.
- I have to downgrade to 1.8.4 without using beagle.
- the testEBSP.xml in BEAST 1 seems difficult to converge. After I increased the chain length to 4 times of the original length, the ESS of posterior is still 85 in the attached log. I am running a 100M-state long analysis, it hopefully will finish this night.
I also fixed the names of parameters.
I updated the two XMLs to use HCV dataset. There were some significant changes of BEAST 1. It looks unfamiliar to me now. I tried my best to make both XMLs comparable. I usedploidy="2.0"
in beast1 and set factor = 2
in beast2. I cannot make Dirichlet working in beast 1.8, so I used Uniform prior for frequencies in both XMLs. Then, I manually compared the priors and likelihood sections in both XMLs to make them similar.
The updated results and xmls are attached. Please use 20% burnin:
posterior: | beast2. | beast 1 mean | -6867.1698 | -6897.0506 stderr of mean | 1.3096 | 2.5939
-6867.1698 - -6897.0506 = 29.8808 (1.3096 + 2.5939) * 2 = 7.807
I just found beast2 default not to use "Intervals Middle" but beast 1 did, which made above two models different. I will re-run the tests tomorrow.
The new logs at long runs are attached. Both suppose to use the same priors and models:
Tree stats are same. But the coalescent likelihoods and popSizes are different, which may be caused by the different implementation of models?
@walterxie it looks like the first population size, and the population mean are consistent between the runs, but subsequent population sizes are double in b1 compared to b2. Maybe the factor=2
has something to do with that?
Has there been a time in the past where the test passed? Or did all commits since the beginning of time contain this faulty test? If there is a time in the past where this test passed, then it must be possible to find the time the test stopped passing and investigate the changes made in that commit that caused the change of behaviour.
@rbouckaert I may be wrong, but according to Joseph's tutorial, the factor
looks equivalent to ploidy
in BEAST1. I will go through the code to see whether I could confirm this or not.
I am suspecting the implementations of EBSP model between BEAST 2 and 1 may have differences. Not only popSizes, but the indicators are very different as well. I have tried to make the priors as similar as I could in the above tests. The posteriors of tree stats and subst model parameters are looking very similar.
This test was incomplete but should be excluded from Ant testing. I am not sure when it started to be used.
I would change the ploidy to 1 to check whether that fixes the difference.
OK. I also found the EBSP analyser in both, but they required to log trees. I will try these 2 changes.
I have completely re-designed the tests as Alexei's advised:
- use HCV data set.
- use
factor=1
andploidy =1
. After reading both code, I can confirm they are the same parameter to scale the popSize.
I ran the analysers in both beasts. So the new logs, xmls and plots are attached below.
I also combined the two plots into one, where X is "time before present", blue is beast1, red is beast2: b1vs2.pdf
I think there are differences between BEAST 1 and 2. But not sure why it only affects the popSize near the present. And I am also not sure the wider HPD interval near the present (beast1) is correct or narrower (beast2)?
Here is Joseph's plot https://bmcecolevol.biomedcentral.com/articles/10.1186/1471-2148-8-289/figures/7
I found the difference between beast 1 plot and beast 2 was caused by the MixedDistributionLikelihood
only existing in beast 1 xml prior section.
<mixedDistributionLikelihood>
<distribution0>
<exponentialDistributionModel idref="demographic.populationMeanDist"/>
</distribution0>
<distribution1>
<exponentialDistributionModel idref="demographic.populationMeanDist"/>
</distribution1>
<data>
<parameter idref="demographic.popSize"/>
</data>
<indicators>
<parameter idref="demographic.indicators"/>
</indicators>
</mixedDistributionLikelihood>
BEAST 2 used
<prior id="popSizePrior.alltrees" name="distribution" x="@popSizes.alltrees">
<Exponential id="popPriorDist.EBSP" mean="@populationMean.alltrees" name="distr"/>
</prior>
I will try to make these two equivalent, before continuing the comparison.
The fragment indicates it is possible to assign a different prior to popSize
entries based on the indicators
, but since distribution0
and distribution1
refer to the same distribution, it should still produce the same prior as the one in BEAST 2 (assuming an exponential with mean=populationMean is used). If there is a difference, it must be something else.
The plots b1vs2.pdf shows only a very small difference at the uncertainty intervals at the beginning -- the means are a very good match. Perhaps this has something to do with burn-in?
Yes. You are correct about the mixedDistributionLikelihood
. I also double-checked the Java code and XMLs for this and other priors except of Coalescent
. They look same. So the only part to be unchecked is EBSP coalescent.
If you look at the beast 1 log, the mean of the posterior of popSize mean is about 8.8, and then all 63 means of popSizes are between 7.4 and 9.2. But beast 2 log only has the mean of the 1st popSize around 8.8, and the rest of 62 means are around 4. Beast 1 result looks more reasonable.
If the 1st popSize starts from tips (present). I am suspecting there may be a bug in the likelihood calculation of beast 2 EBSP to halve the popSizes after the 1st interval towards the root. But I do not understand why the plots are matched when the time is away from the present, if this assumption could be true. I will have a try.
I ran the another comparison tests with new setting: stepwise, and not using mid points. BEAST 2's plot seemed wrong:
But BEAST 1 looked fine:
XMLs and logs are shared in the Dropbox.
Interestingly, values on the left hand side of the graph are significantly different, but on the right hand side they look compatible.
BEAST2 is definitely broken with these settings. The BEAST1 results are the familiar results for this data set.
@alexeid I found another potential problem as well: after I used the beast 1 analyser to plot beast 2 log as you advised, the plot shape looked same to beast 1, but the mean and HPD intervals were different.
I think the similar shape cannot confirm the bug is only in beast 2 beast.app.tools.EBSPAnalyser
.
In addition, their plotting mechanisms are totally different. Not sure why the compatibility was not considered at all. So the difference is:
- beast 1 analyser uses the original log, and takes "popSize" and "indicators" as the input.
- beast 2 has to log an extra log default to "EBSP.log", which looks like containing the 1st popSize and the coalescent intervals. Then the beast 2 analyser takes this special log as the input. Beast 1 does not have this, I cannot test beast 1 plot using beast 2 analyser. Otherwise it would be another good comparison.
Also considering the difference and strange pattern of the posterior distribution of popSizes and indicators, I assume the bug could be still somewhere in CompoundPopulationFunction
, perhaps.
I added two new xmls examples/beast2vs1/beast1/testEBSP-b1.xml
examples/beast2vs1/testEBSP-b2.xml
to replicate these problems. The previous 2 xmls were renamed to testEBSP-old.xml.