scikit-learn-extra icon indicating copy to clipboard operation
scikit-learn-extra copied to clipboard

why the medoids are outside of clusters

Open dinani65 opened this issue 3 years ago • 7 comments

Hi I have used sklearn_extra for clustering my data based on cosine similarity. The data is 100-dimentional vectors. After clustering, I reduce dimensionality to visualize the clustering. I am a bit confused about 'kmedoids.cluster_centers', is it returning medoids of clusters? should the medoids be approximately in the middle of the clusters? when I visualize the clusters, the kmedoids.cluster_centers are outside of the clusters. image

dinani65 avatar Jan 04 '21 10:01 dinani65

The visualization of a 100-dimensional vector is a very tricky question. I don't know how you reduced the dimension but the curse of dimensionality makes it so that the geometry is not very intuitive, informally we often say that in very high dimension all the points are far from one another. If a point is in the center of the cluster as you say, it is not necessarily so that when projected we always have a point center of the cluster. The fact that you use cosine distance does not help because cosine distance is not as intuitive as euclidean either.

Example :

import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import cosine_distances
d = 100
N = 100
np.random.seed(42)

X = np.random.normal(size=[N,d])

y = X[:, 0] > 0 # let us say that the clustering is in the half space


D1 = cosine_distances(X[y], X[y]) # Distance matrix in cluster 1
center1  =  np.argmin(np.sum(D1, axis=1)) # compute the point that minimize the inertia


D2 = cosine_distances(X[~y], X[~y]) # Distance matrix in cluster 2
center2  =  np.argmin(np.sum(D2, axis=1)) # compute the point that minimize the inertia

plt.scatter(X[:, 0], X[:,1], c=y)
plt.scatter([X[center1][0]], [X[center1][1]], c = "lime")
plt.scatter([X[center2][0]], [X[center2][1]], c = "lime")

I obtain the following figure, centroids are in green :

image

EDIT : Rk that it can also be a convergence problem, KMedoids is not very stable in high dimension... no clustering algorithm is really stable in high dimension. It can also be a bug but we can't conclude that it is a bug with just what you say.

TimotheeMathieu avatar Jan 04 '21 10:01 TimotheeMathieu

You used cosine similarity. Distance does not matter, only angle. Depending on how your data was before you projected it, these two points may well have had the smallest sum of angles.

kno10 avatar Feb 02 '21 12:02 kno10

For the euclidean distance, we observe the same phenomenon

import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import euclidean_distances
d = 100
N = 100
np.random.seed(42)

X = np.random.normal(size=[N,d])

y = X[:, 0] > 0 # let us say that the clustering is in the half space


D1 = euclidean_distances(X[y], X[y]) # Distance matrix in cluster 1
center1  =  np.argmin(np.sum(D1, axis=1)) # compute the point that minimize the inertia


D2 = euclidean_distances(X[~y], X[~y]) # Distance matrix in cluster 2
center2  =  np.argmin(np.sum(D2, axis=1)) # compute the point that minimize the inertia

plt.scatter(X[:, 0], X[:,1], c=y)
plt.scatter([X[center1][0]], [X[center1][1]], c = "lime")
plt.scatter([X[center2][0]], [X[center2][1]], c = "lime")

image

TimotheeMathieu avatar Feb 02 '21 12:02 TimotheeMathieu

center1 and center2 are indexes into the y and ~y subsets in your code. X[center1] is wrong, you want X[y][center1] and X[~y][center2]. Hence you plot the wrong points.

kno10 avatar Feb 02 '21 12:02 kno10

Thanks for the catch of the bug. Here is the result with the bug corrected, still not in the middle of the cluster:

image

TimotheeMathieu avatar Feb 02 '21 19:02 TimotheeMathieu

Because you only plot 2 of 100 dimensions. It is not central in each individually.

kno10 avatar Feb 02 '21 23:02 kno10

Yes it was exactly my point, thank you.

TimotheeMathieu avatar Feb 03 '21 07:02 TimotheeMathieu