Allow setting random state for reproducibility
Dear Miles,
I have used gap-stat on a same dataset. However, the optimal number of clusters that gap-stat returns is not always the same. I guess this happens because the reference distribution is randomly generated (actually, you use numpy for that in the code). So, for reproducibility reasons, it appears reasonable to have optimalK function with an argument
random_state.
If you agree, maybe I would be able to change the code accordingly, with your directions and help.
Thanks!
Or even better: Letting the user select the number of Monte Carlo (“bootstrap”) samples. The reason is given in the documentation of R function clusGap:
The main result
$Tab[,"gap"] of course is from bootstrapping aka Monte Carlo simulation and hence random, or equivalently, depending on the initial random seed (see set.seed()). On the other hand, in our experience, using B = 500 gives quite precise results such that the gap plot is basically unchanged after an another run.
Hi!
I suppose one could use the clusterer param to add their own callable which took a random state? But anyway, I'm open for this addition so have no strong opinions on how it ought to be done. So please feel free to open another PR and we'll see how it goes. :+1:
Dear Miles,
One can run R from inside Python, via package rpy2. Using the same dataset, R package NbClust provides consistently the same optimal number of clusters and the same value for the gap-statistic:
# R code
library(NbClust)
res <- NbClust(data_normalized, distance = "euclidean",
min.nc = 2, max.nc = 10, method = "kmeans", index="gap")
print(res$Best.nc)
So, I have to study the way they do that.
Have a nice Sunday!
Paulo
Dear Miles,
I have used
gap-staton a same dataset. However, the optimal number of clusters thatgap-statreturns is not always the same. I guess this happens because the reference distribution is randomly generated (actually, you usenumpyfor that in the code). So, for reproducibility reasons, it appears reasonable to haveoptimalKfunction with an argument
random_state.If you agree, maybe I would be able to change the code accordingly, with your directions and help.
Thanks!
Added this functionality in #61.