kmodes
kmodes copied to clipboard
Determining the optimal number of clusters
Hi I've been using kmodes (https://www.rdocumentation.org/packages/klaR/versions/0.6-12/topics/kmodes) from the KlaR, an R package to cluster my data set. I wanted to try using kmodes in python to see if I get similar results. However, I don't see how I can determine the optimal number of clusters in the python version of kmodes.
In the klaR package, I can use the $withindiff function to get the within-cluster simple-matching distance for each cluster. This allows me to calculate the sum of error for for k= 2, 3, 4...., etc. and select the optimal number of clusters based on the largest sum of error difference between each iteration of clustering with varying k values.
In the kmodes for python, how do you determine the optimal k?
Simply by running the clustering for multiple k values, as there currently is no wrapper that does this for you automatically.
It would be nice to combine this with the silhouette plot mentioned here
PRs are welcomed. :)
And how do you determine the optimal k for the k-prototypes?
I am working on doing clustering on mixed categorical and numerical attributes. When I stumbled across your k-prototypes implementation, I want to implement it in my case. However, I'm a bit confused on how to evaluate the result from the k-prototypes algorithm (e.g. determine the optimal k).
But as mentioned that silhouette plot would do the trick, I've been thinking to change the Euclidean distance into the k-prototypes cost function to determine the intra- and inter- cluster distance on silhouette analysis.
Do you think that would work?
@dexdimas
Hi @dexdimas , @nicodv , All
I am also working with K-prototypes , and trying to find the optimal K value, can you please share your experience/approach to find optimal K when using K-prototypes,it would be great if you can share some code and links .
Any suggestions for plotting very hight dimensional data , I am working with 56 features where I have 35 categorical columns[ 3 cols have about 10,000 categories and all others have about 10-12 categories] , 11 Numerical columns and 10 binary columns, with data size of 80 Million records
ps: I am trying to find patterns and outliers , trying to find outliers that would not fit in with normal clusters, I am using health care data.
Thank you in advance , any help is appreciated.
Hi @nicodv,
I'm working on an implementation of silhouette score, which uses dissimilarity (between each element of the array) as a distance metric and gives the optimal number of clusters, k. What other metric would you consider as a good basis for silhouette score calculation?
Using silhouette for the numerical variables, and continue using the cost for all
with a small change here in kprototypes.py
and this piece of code in the implementation:
lista=[]
for i in range(20,23):
nc=i
start = time.time()
kp = KPrototypes(n_clusters = nc, init = 'Cao', n_init =22, verbose = 1, random_state=4, n_jobs=8 )
clusters=kp.fit_predict(data.values, categorical = [9])
end = time.time()
lista.append([i,"Silhouette Coefficient: %0.3f"% metrics.silhouette_score(data.iloc[:,0:9], kp.labels_),'cost: %0.3f'%kp.cost_,
'tiempo (s): %0.3f'% (end-start),'best run: %0.3f'% (list(kp.best.keys())[0]+1)])
you can have a half result
hello,
how to calculate the silhouette score in k prototypes, if I have a silhouette score of categorical data (hamming) and a silhouette score of numerical data (euclidean)? Should I average weighted between the two coefficients according to the gamma value?
How would this weighted average be calculated?
It could be done this way:
( silhouette_category * kp.gamma ) + ( silhouette_numeric * (1 - kp.gamma ) )
thanks
@matiasscorsetti gamma is not from [0,1] (a proportionality coef) but from [0,+inf[
From reading the R implementation of "silhouette_kproto" line 1134 : Rdocumentation (gamma is called lambda there)
It seems to me they are weighting both silhouettes values like following: ( silhouette_category * gamma ) + ( silhouette_numeric )
but I may be wrong...
an idea @nicodv ?