GaussianMixtures.jl
GaussianMixtures.jl copied to clipboard
Invalid weights under repeated data
If you have repeated data you can end up with NaNs in your weights.
using GaussianMixtures
data = Array(Float64, 110, 1) # one variable
data[1:100] = randn(100)
data[101:110] = randn(10) + 10.0
model = GaussianMixtures.GMM(2, data)
println(model)
data[101:110] = fill(10.0, 10)
model = GaussianMixtures.GMM(2, data)
println(model)
Results in
GaussianMixtures.GMM{Float64,Array{Float64,2}}(2,1,[0.09090909090969347,0.9090909090903065],[9.463489092287428
0.19884123133198583],[0.9811041259003161
0.9898456887418752],[,,,,,,,,,,,,],110)
WARNING: 4 pathological elements normalized
WARNING: 4 pathological elements normalized
WARNING: 4 pathological elements normalized
WARNING: 4 pathological elements normalized
WARNING: 4 pathological elements normalized
WARNING: 4 pathological elements normalized
WARNING: 4 pathological elements normalized
WARNING: 4 pathological elements normalized
WARNING: 4 pathological elements normalized
WARNING: 4 pathological elements normalized
GaussianMixtures.GMM{Float64,Array{Float64,2}}(2,1,[NaN,NaN],[0.0
0.0],[1.0
1.0],[,,,,,,,,,,,,,,,,,,,,,,],110)
It looks like varfloor
is a parameter in em!
but it is not exposed to GMM
.
One problem is that, in em!
, the line tooSmall = any(gmm.Σ .< varfloor, 2)
will not find the offending NaN values. Also, it looks like N
and F
from stats()
are all NaN as well, so the mean is also NaN.
Thanks. Yes, it is likely that repeated data ends up in a gaussian on its own, which will lead to vanishing variance. I suppose the relevant code could be revised on that point. I've never really been charmed by the logic in the code at that point.