pygmmis icon indicating copy to clipboard operation
pygmmis copied to clipboard

Fitting two simple Gaussians

Open zecevicp opened this issue 4 years ago • 12 comments

Hi, I have another issue. I'm trying to fit 2 Gaussian components in 2D onto samples obtained from a combination of two simple Gaussians. The example is pretty basic and the expectation is for pygmmis to find the two Gaussian centers easily. However, that is not the case.

Please check the minimal example below. There are two Gaussian components centered at (-1.5, 0) and (1.5, 0). pygmmis fits a single Gaussian at (-0.34, 0).

Is this expected? Am I doing something wrong?

Thanks!

import pygmmis

def gaus2d(xy, amplitude=1, mx=0, my=0, sx=1, sy=1, offset=0):
    x, y = xy
    g = offset + amplitude / (2. * np.pi * sx * sy) * np.exp(-((x - mx)**2. / (2. * sx**2.) + (y - my)**2. / (2. * sy**2.)))
    return g.ravel()

NPOINTS=201
x = np.linspace(-5, 5, NPOINTS)
y = np.linspace(-5, 5, NPOINTS)
x, y = np.meshgrid(x, y)
z = gaus2d((x, y), 15, -1.5, 0, 1, 1, 0)
g2dx2 = z.reshape(NPOINTS, NPOINTS)
z = gaus2d((x, y), 10, 1.5, 0, 1, 1, 0)
g2dx2 += z.reshape(NPOINTS, NPOINTS)

def normalize(img):
    positive = img - np.min(img)
    return positive / np.max(positive)

lkimghist = normalize(g2dx2) * 100
xmin, xmax, ymin, ymax = (x.min(), x.max(), y.min(), y.max())
samples = []
for i in range(NPOINTS):
    for j in range(NPOINTS):
        for n in range(int(lkimghist[i, j])):
            samples.append([x[i,j], y[i,j]])
samples = np.array(samples)
gmm = pygmmis.GMM(K=2, D=2)
mins = np.min(samples, axis=0)
maxs = np.max(samples, axis=0)
def selcallback(smpls):
    return np.all(np.all([smpls <= maxs, smpls >= mins], axis=0), axis=1)
pygmmis.fit(gmm, samples, sel_callback=selcallback, maxiter=50)

zecevicp avatar Feb 17 '21 20:02 zecevicp

Yes, that can happen. The likelihood for mixture models is multi-modal, so the initialization of the GMM plays an important role. You are currently using the default random initialization, and you probably got a bad draw. You can choose different initializations with the option init_method, including kmeans (which initializes the mixtures from the K-means clustering result).

pmelchior avatar Feb 22 '21 18:02 pmelchior

Yes, using kmeans initialization does help in 2D. But when I try it in 7D (the case I'm really interested in), I get the following error:

~/anaconda3/envs/multiproc/lib/python3.8/site-packages/pygmmis.py in fit(gmm, data, covar, R, init_method, w, cutoff, sel_callback, oversampling, covar_callback, background, tol, miniter, maxiter, frozen, split_n_merge, rng)
    663 
    664     try:
--> 665         log_L, N, N2 = _EM(gmm, log_p, U, T_inv, log_S, H, data_, covar=covar_, R=R, sel_callback=sel_callback, oversampling=oversampling, covar_callback=covar_callback, w=w, pool=pool, chunksize=chunksize, cutoff=cutoff, background=background, p_bg=p_bg, changeable=changeable, miniter=miniter, maxiter=maxiter, tol=tol, rng=rng)
    666     except Exception:
    667         # cleanup

~/anaconda3/envs/multiproc/lib/python3.8/site-packages/pygmmis.py in _EM(gmm, log_p, U, T_inv, log_S, H, data, covar, R, sel_callback, oversampling, covar_callback, background, p_bg, w, pool, chunksize, cutoff, miniter, maxiter, tol, prefix, changeable, rng)
    771 
    772     while it < maxiter: # limit loop in case of slow convergence
--> 773         log_L_, N, N2_, N0_ = _EMstep(gmm, log_p, U, T_inv, log_S, H, N0, data, covar=covar, R=R, sel_callback=sel_callback, omega=omega, oversampling=oversampling, covar_callback=covar_callback, background=background, p_bg=p_bg , w=w, pool=pool, chunksize=chunksize, cutoff=cutoff_nd, tol=tol, changeable=changeable, it=it, rng=rng)
    774 
    775         # check if component has moved by more than sigma/2

~/anaconda3/envs/multiproc/lib/python3.8/site-packages/pygmmis.py in _EMstep(gmm, log_p, U, T_inv, log_S, H, N0, data, covar, R, sel_callback, omega, oversampling, covar_callback, background, p_bg, w, pool, chunksize, cutoff, tol, changeable, it, rng)
    846         # create fake data with same mechanism as the original data,
    847         # but invert selection to get the missing part
--> 848         data2, covar2, N0, omega2 = draw(gmm, len(data)*oversampling, sel_callback=sel_callback, orig_size=N0*oversampling, invert_sel=True, covar_callback=covar_callback, background=background, rng=rng)
    849         #data2 = createShared(data2)
    850         N0 = N0/oversampling

~/anaconda3/envs/multiproc/lib/python3.8/site-packages/pygmmis.py in draw(gmm, obs_size, sel_callback, invert_sel, orig_size, covar_callback, background, rng)
   1183     # TODO: may want to decide whether to add noise before selection or after
   1184     # Here we do noise, then selection, but this is not fundamental
-> 1185     data, covar = _drawGMM_BG(gmm, orig_size, covar_callback=covar_callback, background=background, rng=rng)
   1186 
   1187     # apply selection

~/anaconda3/envs/multiproc/lib/python3.8/site-packages/pygmmis.py in _drawGMM_BG(gmm, size, covar_callback, background, rng)
   1112     # draw sample from model, or from background+model
   1113     if background is None:
-> 1114         data2 = gmm.draw(int(np.round(size)), rng=rng)
   1115     else:
   1116         # model is GMM + Background

~/anaconda3/envs/multiproc/lib/python3.8/site-packages/pygmmis.py in draw(self, size, rng)
    250         """
    251         # draw indices for components given amplitudes, need to make sure: sum=1
--> 252         ind = rng.choice(self.K, size=size, p=self.amp/self.amp.sum())
    253         N = np.bincount(ind, minlength=self.K)
    254 

mtrand.pyx in numpy.random.mtrand.RandomState.choice()

ValueError: probabilities contain NaN

zecevicp avatar Feb 23 '21 08:02 zecevicp

It looks like you got some nans in your parameters. Hard to tell from this trace, but I suspect that the k-mean initialization created them. Check the result if the k-means init directly by calling initFromKMeans.

pmelchior avatar Feb 27 '21 21:02 pmelchior

initFromKMeans works OK it seems. However, when GMM object is initiated from KMeans, then for some reason after an M step the A2 array contains all nans. _Msums seems to be the culprit but I wasn't able to debug it.

zecevicp avatar Mar 04 '21 20:03 zecevicp

Peter, could you please take a look at this repo: https://github.com/zecevicp/pygmmis_test ?

I placed a notebook there with a minimal example to reproduce the error and the accompanying data.

zecevicp avatar Mar 18 '21 19:03 zecevicp

The traceback in the notebook shows that the error arises in the logsumexp function (line 156 in pygmmis.py). This suggests that there's a numerical over- or under-run. I'm guessing that with K=20 in 7D, some of the components have incredibly small probabilities for some data points. This can be rectified by ignoring those samples by setting the cutoff parameter to something like 3 or 5. This allows the EM to neglect samples that are more than 3 or 5 sigma away from the center of a component.

pmelchior avatar Mar 19 '21 03:03 pmelchior

Unfortunately, cutoff of 3 or 5 produces the same exception (sometimes the message "Singular matrix" appears). Again, this only happens when KMeans clustering is the init method...

zecevicp avatar Mar 19 '21 09:03 zecevicp

I'm surprised, but it's conceivable that the k-Means initialization isn't great in higher-D settings. Can you check if there are crazy values in the component parameters when you only initialize with the k-Means.

Also, can you just try the default initialization if this problem is limited to k-Means.

pmelchior avatar Mar 19 '21 20:03 pmelchior

K-means initialization looks OK to me. But somehow it gets lost after a few iterations (for some reason the A2 array after the M step contains all NaNs).

Default initialization brings me back to the original problem stated at the beginning of this issue.

zecevicp avatar Mar 19 '21 20:03 zecevicp

I remain confused. The original problem arose in 2D, the k-Mean problem in 7D. These are widely different settings.

Are you saying that even in the 7D data set, the default initialization gets you 20 degenerate Gaussians?

pmelchior avatar Mar 21 '21 18:03 pmelchior

I'll try to clarify...

I'm interested in the 7D case. 2D case was to see how pygmmis takes care of the case where scikit-learn implementations fail. It handles it well, IF k-means clustering initialization is used.

The procedure doesn't fail in the same way when using "random" initialization for the 7D dataset, but I'm not satisfied with the results (marginalized fits compared to histograms show that the results are off similarly to what I was seeing in 2D).

So k-means init seems to be my only hope of using pygmmis, but it fails in 7D with the above error. I was hoping you could see what is wrong from the stack trace (the notebook I posted above reproduces the error).

zecevicp avatar Mar 21 '21 20:03 zecevicp

You can chose any existing (options are 'minmax','random','kmeans','none') or custom initialization. If the 7D histogram shows you the location and/or size of specific features, you could turn that into an initialization.

pmelchior avatar Mar 23 '21 17:03 pmelchior