ClusterR [Feature Request] for GMM implement gmm

Hi,

Currently GMM implements the ’gmm_diag’ class of the Armadillo library - however, 'gmm_full' is sensible for many data types. Could this functionality be added? I am aware this will increase computational complexity, but most likely it will still be faster than MClust due to the Armadillo library link.

Is this something that is difficult to implement? Was it a design choice not to include it? It would be really nice to have the option.

thanks!

Mar 30 '23 15:03 FMKerckhof

@FMKerckhof let me have a look into this because from a first look it seems it requires a cpp template to either return "gmm_full" or "gmm_diag" model (I'll have more time on Sunday afternoon to see if this is feasible and make the adjustments). A few parameters work either with "gmm_full" or "gmm_diag" based on the armadillo documentation

diagonal_gmm

Mar 31 '23 04:03 mlampros

I modified the "GMM()" function and now it takes an additional parameter "full_covariance_matrices" which is set to FALSE so that diagonal covariance matrices are returned by default. If this parameter is TRUE then full covariance matrices will be returned. However, there is a difference in the dimensions of the "covariance_matrices" output object. In case of diagonal covariance matrices the output object is a matrix whereas in case of full covariance matrix the output is a 3-dimensional object,

require(ClusterR)
data(dietary_survey_IBS)
dat = as.matrix(dietary_survey_IBS[, -ncol(dietary_survey_IBS)])
dat = center_scale(dat)

# diagonal covariance matrices
gmm = GMM(data = dat, 
          gaussian_comps = 3, 
          full_covariance_matrices = FALSE,
          verbose = TRUE)
str(gmm)
# List of 5
 # $ call               : language GMM(data = dat, gaussian_comps = 3, verbose = TRUE, full_covariance_matrices = FALSE)
 # $ centroids          : num [1:3, 1:42] 0.182 -0.472 0.585 0.527 -0.603 ...
 # $ covariance_matrices: num [1:3, 1:42] 0.7439 0.2761 1.4364 1.5807 0.0649 ...
 # $ weights            : num [1:3] 0.141 0.5 0.359
 # $ Log_likelihood     : num [1:400, 1:3] -61.2 -61.6 -71.6 -72.4 -58.9 ...
 # - attr(*, "class")= chr [1:2] "GMMCluster" "Gaussian Mixture Models"

# full covariance matrices
gmm_f = GMM(data = dat, 
          gaussian_comps = 3, 
          full_covariance_matrices = TRUE,
          verbose = TRUE)
str(gmm_f)
# List of 5
# $ call               : language GMM(data = dat, gaussian_comps = 3, verbose = TRUE, full_covariance_matrices = TRUE)
# $ centroids          : num [1:3, 1:42] 0.15 -0.472 0.626 0.535 -0.603 ...
# $ covariance_matrices: num [1:42, 1:42, 1:3] 0.7333 -0.0758 0.0868 0.0951 0.0306 ...
# $ weights            : num [1:3] 0.162 0.5 0.338
# $ Log_likelihood     : num [1:400, 1:3] -109.6 -52.2 -175.7 -116.5 -49.5 ...
# - attr(*, "class")= chr [1:2] "GMMCluster" "Gaussian Mixture Models"

That means the "predict_GMM()" function needs adjustments (especially the Rcpp function) to return the log likelihoods, probabilities and clusters. I wrote this package back in 2017 and since then I haven't reviewed the literature related to GMM. I would also accept a PR for the predict function adjustment. The current changes can be installed using,

remotes::install_github('mlampros/ClusterR', upgrade = 'always', dependencies = TRUE, repos = 'https://cloud.r-project.org/')

Apr 02 '23 12:04 mlampros

Thanks @mlampros , much appreciated! W.r.t. the PR for the predict_GMM function: while I am a fairly competent R programmer, my Rcpp knowledge is next-to-none. I will fork and see how far I can get.

Apr 02 '23 16:04 FMKerckhof

[Feature Request] for GMM implement gmm_full covariance type?