GaussianMixtures.jl icon indicating copy to clipboard operation
GaussianMixtures.jl copied to clipboard

Invalid weights under repeated data

Open tawheeler opened this issue 8 years ago • 1 comments

If you have repeated data you can end up with NaNs in your weights.

using GaussianMixtures

data = Array(Float64, 110, 1) # one variable
data[1:100] = randn(100)
data[101:110] = randn(10) + 10.0

model = GaussianMixtures.GMM(2, data)
println(model)

data[101:110] = fill(10.0, 10)
model = GaussianMixtures.GMM(2, data)
println(model)

Results in

GaussianMixtures.GMM{Float64,Array{Float64,2}}(2,1,[0.09090909090969347,0.9090909090903065],[9.463489092287428
 0.19884123133198583],[0.9811041259003161
 0.9898456887418752],[,,,,,,,,,,,,],110)
WARNING: 4 pathological elements normalized
WARNING: 4 pathological elements normalized
WARNING: 4 pathological elements normalized
WARNING: 4 pathological elements normalized
WARNING: 4 pathological elements normalized
WARNING: 4 pathological elements normalized
WARNING: 4 pathological elements normalized
WARNING: 4 pathological elements normalized
WARNING: 4 pathological elements normalized
WARNING: 4 pathological elements normalized
GaussianMixtures.GMM{Float64,Array{Float64,2}}(2,1,[NaN,NaN],[0.0
 0.0],[1.0
 1.0],[,,,,,,,,,,,,,,,,,,,,,,],110)

It looks like varfloor is a parameter in em! but it is not exposed to GMM. One problem is that, in em!, the line tooSmall = any(gmm.Σ .< varfloor, 2) will not find the offending NaN values. Also, it looks like N and F from stats() are all NaN as well, so the mean is also NaN.

tawheeler avatar Sep 13 '16 20:09 tawheeler

Thanks. Yes, it is likely that repeated data ends up in a gaussian on its own, which will lead to vanishing variance. I suppose the relevant code could be revised on that point. I've never really been charmed by the logic in the code at that point.

davidavdav avatar Sep 14 '16 07:09 davidavdav