kmodes Determining the optimal number of clusters

Determining the optimal number of clusters

Open eugeniahrho opened this issue 7 years ago • 7 comments

Hi I've been using kmodes (https://www.rdocumentation.org/packages/klaR/versions/0.6-12/topics/kmodes) from the KlaR, an R package to cluster my data set. I wanted to try using kmodes in python to see if I get similar results. However, I don't see how I can determine the optimal number of clusters in the python version of kmodes.

In the klaR package, I can use the $withindiff function to get the within-cluster simple-matching distance for each cluster. This allows me to calculate the sum of error for for k= 2, 3, 4...., etc. and select the optimal number of clusters based on the largest sum of error difference between each iteration of clustering with varying k values.

In the kmodes for python, how do you determine the optimal k?

Jun 10 '17 20:06 eugeniahrho

Simply by running the clustering for multiple k values, as there currently is no wrapper that does this for you automatically.

It would be nice to combine this with the silhouette plot mentioned here

PRs are welcomed. :)

Jun 16 '17 03:06 nicodv

And how do you determine the optimal k for the k-prototypes?

I am working on doing clustering on mixed categorical and numerical attributes. When I stumbled across your k-prototypes implementation, I want to implement it in my case. However, I'm a bit confused on how to evaluate the result from the k-prototypes algorithm (e.g. determine the optimal k).

But as mentioned that silhouette plot would do the trick, I've been thinking to change the Euclidean distance into the k-prototypes cost function to determine the intra- and inter- cluster distance on silhouette analysis.

Do you think that would work?

May 23 '18 18:05 dexdimas

@dexdimas

Hi @dexdimas , @nicodv , All

I am also working with K-prototypes , and trying to find the optimal K value, can you please share your experience/approach to find optimal K when using K-prototypes,it would be great if you can share some code and links .

Any suggestions for plotting very hight dimensional data , I am working with 56 features where I have 35 categorical columns[ 3 cols have about 10,000 categories and all others have about 10-12 categories] , 11 Numerical columns and 10 binary columns, with data size of 80 Million records

ps: I am trying to find patterns and outliers , trying to find outliers that would not fit in with normal clusters, I am using health care data.

Thank you in advance , any help is appreciated.

Mar 10 '19 18:03 doyager

Hi @nicodv,

I'm working on an implementation of silhouette score, which uses dissimilarity (between each element of the array) as a distance metric and gives the optimal number of clusters, k. What other metric would you consider as a good basis for silhouette score calculation?

Mar 21 '19 16:03 supreetkt

Using silhouette for the numerical variables, and continue using the cost for all with a small change here in kprototypes.py Captura4

and this piece of code in the implementation:

lista=[]
for i in range(20,23):
    nc=i
    start = time.time()
    kp = KPrototypes(n_clusters = nc, init = 'Cao', n_init =22, verbose = 1, random_state=4, n_jobs=8 )
    clusters=kp.fit_predict(data.values, categorical = [9])
    end = time.time()
    lista.append([i,"Silhouette Coefficient: %0.3f"% metrics.silhouette_score(data.iloc[:,0:9], kp.labels_),'cost: %0.3f'%kp.cost_,
                                     'tiempo (s): %0.3f'% (end-start),'best run: %0.3f'% (list(kp.best.keys())[0]+1)])

you can have a half result

May 08 '19 15:05 PabloVergara

hello,

how to calculate the silhouette score in k prototypes, if I have a silhouette score of categorical data (hamming) and a silhouette score of numerical data (euclidean)? Should I average weighted between the two coefficients according to the gamma value?

How would this weighted average be calculated?

It could be done this way:

( silhouette_category * kp.gamma ) + ( silhouette_numeric * (1 - kp.gamma ) )

thanks

Oct 05 '19 20:10 matiasscorsetti

@matiasscorsetti gamma is not from [0,1] (a proportionality coef) but from [0,+inf[

From reading the R implementation of "silhouette_kproto" line 1134 : Rdocumentation (gamma is called lambda there)

It seems to me they are weighting both silhouettes values like following: ( silhouette_category * gamma ) + ( silhouette_numeric )

but I may be wrong...

an idea @nicodv ?

Mar 12 '21 16:03 arnaud-nt2i

kmodes kmodes copied to clipboard

Determining the optimal number of clusters

kmodes
kmodes copied to clipboard