scikit-learn-extra
scikit-learn-extra copied to clipboard
why the medoids are outside of clusters
Hi
I have used sklearn_extra for clustering my data based on cosine similarity. The data is 100-dimentional vectors.
After clustering, I reduce dimensionality to visualize the clustering. I am a bit confused about 'kmedoids.cluster_centers', is it returning medoids of clusters? should the medoids be approximately in the middle of the clusters?
when I visualize the clusters, the kmedoids.cluster_centers are outside of the clusters.
The visualization of a 100-dimensional vector is a very tricky question. I don't know how you reduced the dimension but the curse of dimensionality makes it so that the geometry is not very intuitive, informally we often say that in very high dimension all the points are far from one another. If a point is in the center of the cluster as you say, it is not necessarily so that when projected we always have a point center of the cluster. The fact that you use cosine distance does not help because cosine distance is not as intuitive as euclidean either.
Example :
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import cosine_distances
d = 100
N = 100
np.random.seed(42)
X = np.random.normal(size=[N,d])
y = X[:, 0] > 0 # let us say that the clustering is in the half space
D1 = cosine_distances(X[y], X[y]) # Distance matrix in cluster 1
center1 = np.argmin(np.sum(D1, axis=1)) # compute the point that minimize the inertia
D2 = cosine_distances(X[~y], X[~y]) # Distance matrix in cluster 2
center2 = np.argmin(np.sum(D2, axis=1)) # compute the point that minimize the inertia
plt.scatter(X[:, 0], X[:,1], c=y)
plt.scatter([X[center1][0]], [X[center1][1]], c = "lime")
plt.scatter([X[center2][0]], [X[center2][1]], c = "lime")
I obtain the following figure, centroids are in green :
EDIT : Rk that it can also be a convergence problem, KMedoids is not very stable in high dimension... no clustering algorithm is really stable in high dimension. It can also be a bug but we can't conclude that it is a bug with just what you say.
You used cosine similarity. Distance does not matter, only angle. Depending on how your data was before you projected it, these two points may well have had the smallest sum of angles.
For the euclidean distance, we observe the same phenomenon
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import euclidean_distances
d = 100
N = 100
np.random.seed(42)
X = np.random.normal(size=[N,d])
y = X[:, 0] > 0 # let us say that the clustering is in the half space
D1 = euclidean_distances(X[y], X[y]) # Distance matrix in cluster 1
center1 = np.argmin(np.sum(D1, axis=1)) # compute the point that minimize the inertia
D2 = euclidean_distances(X[~y], X[~y]) # Distance matrix in cluster 2
center2 = np.argmin(np.sum(D2, axis=1)) # compute the point that minimize the inertia
plt.scatter(X[:, 0], X[:,1], c=y)
plt.scatter([X[center1][0]], [X[center1][1]], c = "lime")
plt.scatter([X[center2][0]], [X[center2][1]], c = "lime")
center1 and center2 are indexes into the y and ~y subsets in your code.
X[center1]
is wrong, you want X[y][center1]
and X[~y][center2]
.
Hence you plot the wrong points.
Thanks for the catch of the bug. Here is the result with the bug corrected, still not in the middle of the cluster:
Because you only plot 2 of 100 dimensions. It is not central in each individually.
Yes it was exactly my point, thank you.