best icon indicating copy to clipboard operation
best copied to clipboard

Sigma ≠ standard deviation ⇒ introduce Model v2

Open treszkai opened this issue 5 years ago • 0 comments

This PR fixes a bug in the model specification, which is present in the original publication and the R package as well.

The standard deviation of a t-distribution is not σ. Instead, variance = nu / (nu - 2) * sigma ** 2 if nu > 2, and variance = ∞ if 1 < nu <= 2. Therefore, any posterior sample nu <= 2 assigns a non-zero probability that the variance is ∞, which in real life is never the case.

Proof that SD is underestimated by the R package (Python output is essentially the same):

> require(BEST)
Loading required package: BEST
Loading required package: HDInterval
> N=1000
> testDataDrug = rt(N, df=4)  # sigma = 1
> sd(testDataDrug)
[1] 1.417539
> BESTout = BESTmcmc(testDataDrug)
Waiting for parallel processing to complete...done.
> mean(BESTout$sigma)
[1] 1.006396
> mean(sqrt(BESTout$nu / (BESTout$nu - 2)) * BESTout$sigma)
[1] 1.47081

Samples of nu close to 2 are also problematic, as they blow up the SD estimation. Therefore I changed the sampling of nu as such:

  • originally: ν ~ Exponential(1/29) + 1
  • now: ν ~ Exponential(1/27.5) + 2.5

( t(df=2.5) is very similar to t(df=2) when it's not extremely far from the mean.)


The other change concerns the sampling of σ.

  • Originally, σ was uniformly distributed between σ^ / 1000 and σ^ * 1000 (where σ^ is the sample standard deviation), meaning the prior probability of σ > σ^ was 1000 times that of σ < σ^. This caused an overestimation of σ at small sample sizes.
  • Now, log(σ) is distributed uniformly between log( σ^ / 1000 ) and log( σ^ * 1000 ).

As a consequence, reduced the displayed limits of the std.devs in plot_posterior and plot_all to 99.5% of the samples.

(This is part 3 of 4 pull requests that build on top of one another.)

treszkai avatar Sep 06 '19 17:09 treszkai