GaussianMixtures.jl icon indicating copy to clipboard operation
GaussianMixtures.jl copied to clipboard

Weighted data matrix

Open rgiordan opened this issue 9 years ago • 11 comments

Is there any support in GaussianMixtures for weighted rows in a data matrix? For example, if I have a dataset with many repeated observations, can I pass in a matrix of distinct points and a vector of multiplicities?

rgiordan avatar May 25 '15 21:05 rgiordan

I suppose this is possible. It would probably require a fair bit of rewriting, and thinking of how to keep the interface clean.

davidavdav avatar May 29 '15 18:05 davidavdav

It might be a nice feature request.

In the meantime I got something working by hand using the output of gmmposterior (which was very helpful) so I'm good to go.

rgiordan avatar May 29 '15 21:05 rgiordan

I'm also interested in training using weighted datapoints. Could you share your code for that? Thanks.

eford avatar Jun 13 '15 17:06 eford

I would think we have to add weight support for the stats() functions in stats.jl. I haven't looked at the math yeat, but I suspect it will probably boil down to a boadcasting multiply of γ with the (normalized) weights.

We could add a parameter weights everywhere, but I wouldn't find that a particularly elegant interface. A nicer solution might be to include a possible weight vector in the Data type, since the weights really belong to the data. Is this indeed the use case, that the weights are fixed with the data points?

davidavdav avatar Jun 14 '15 18:06 davidavdav

Yes. For my application, I'm using importance sampling, so each data point has an associated weight.

Adding a weights parameter seems like the natural way to do it to me.

If you want to group the data and weights, then I think it would be better to use a structure of arrays, rather than an array of structures. For some applications, there could be multiple sets of weights for one set of data. E.g., different weights for different choices of priors or different temperatures when using tempering/annealing. I'd propose that those applications are probably best handeled by multiple function calls. But one would want to be able to swap out the weights can be done efficiently, I don't see a problem.

Using a structure of arrays also makes it easier and more efficient to combine different pacakges/libraries.

eford avatar Jun 14 '15 22:06 eford

In the meantime, if you're interested, you can find my hand-rolled version with weights in Celeste.jl/blob/master/src/PSF.jl:fit_psf_gaussians

rgiordan avatar Jun 18 '15 17:06 rgiordan

@rgiordan, Thanks. My application was different enough that I ended up writing my own em! replacement that allows for weighted data (and also training a mixture of t-distributions, rather than Gaussians.) I've just written what I need for my project (e.g., full covar matrices, data in memory). If anyone's interested, those additions are at https://github.com/eford/GaussianMixtures.jl in src/eford_extensions.jl).

eford avatar Jun 23 '15 17:06 eford

I think it would be useful for GaussianMixtures.jl to expose a set method for the sigma matrix. In my opinion, that was the only fiddly part of rolling my own model.

rgiordan avatar Jun 23 '15 18:06 rgiordan

what do you mean by expose a set? you can always store your covariances in gmm.Σ, but don't forget to store them as GaussianMixures.invcovar().

I've gone though your diff's---it seems quite a rewrite of code. Was there not a way to include weighting in existing code?

davidavdav avatar Jun 23 '15 18:06 davidavdav

There might be a way to incorporate weighting. I tried that at first. But after struggling trying to understand how your code was working, I decided it would be easier to rewrite the training function. Feel free to add functionality in a more general way

eford avatar Jun 24 '15 00:06 eford

Ah---that sounds like the code isn't very transparant, which is not great. I suppose it could do with some cleanup and rewrites here and there.

Anyway---when I add weighting to the original code, we now have independent code that can be used for verification.

davidavdav avatar Jun 24 '15 13:06 davidavdav