gap_statistic Allow setting random state for reproducibility

Dear Miles,

I have used gap-stat on a same dataset. However, the optimal number of clusters that gap-stat returns is not always the same. I guess this happens because the reference distribution is randomly generated (actually, you use numpy for that in the code). So, for reproducibility reasons, it appears reasonable to have optimalK function with an argument

random_state.

If you agree, maybe I would be able to change the code accordingly, with your directions and help.

Thanks!

Jan 21 '23 12:01 psads-git

Or even better: Letting the user select the number of Monte Carlo (“bootstrap”) samples. The reason is given in the documentation of R function clusGap:

The main result $Tab[,"gap"] of course is from bootstrapping aka Monte Carlo simulation and hence random, or equivalently, depending on the initial random seed (see set.seed()). On the other hand, in our experience, using B = 500 gives quite precise results such that the gap plot is basically unchanged after an another run.

Jan 21 '23 12:01 psads-git

Hi!

I suppose one could use the clusterer param to add their own callable which took a random state? But anyway, I'm open for this addition so have no strong opinions on how it ought to be done. So please feel free to open another PR and we'll see how it goes. :+1:

Jan 21 '23 20:01 milesgranger

Dear Miles,

One can run R from inside Python, via package rpy2. Using the same dataset, R package NbClust provides consistently the same optimal number of clusters and the same value for the gap-statistic:

# R code
library(NbClust)

res <- NbClust(data_normalized, distance = "euclidean", 
              min.nc = 2, max.nc = 10, method = "kmeans", index="gap")

print(res$Best.nc)

So, I have to study the way they do that.

Have a nice Sunday!

Paulo

Jan 22 '23 12:01 psads-git

Dear Miles,

I have used gap-stat on a same dataset. However, the optimal number of clusters that gap-stat returns is not always the same. I guess this happens because the reference distribution is randomly generated (actually, you use numpy for that in the code). So, for reproducibility reasons, it appears reasonable to have optimalK function with an argument

random_state.

If you agree, maybe I would be able to change the code accordingly, with your directions and help.

Thanks!

Added this functionality in #61.

Jul 05 '23 19:07 lebedov