PGP icon indicating copy to clipboard operation
PGP copied to clipboard

Inconsistency in KMeans clustering result

Open gobear6212 opened this issue 2 years ago • 3 comments

Hi, I was trying to retrain PGP but I run into an issue with scikit-learn's KMeans implementation. Sometimes when the model tries to compute the Ward distances, it throws a broadcast exception for dists = wts * centroid_dists + np.diag(np.inf * np.ones(len(cluster_counts))) because the shapes of wts and centroid_dists are different.

The root cause seems to be that cluster_lbls and cluster_ctrs are inconsistent, so performing np.unique() for the cluster labels returns the wrong cluster_cnts. In scikit-learn's documentation, I notice the following

cluster_centers_ndarray of shape (n_clusters, n_features) Coordinates of cluster centers. If the algorithm stops before fully converging (see tol and max_iter), these will not be consistent with labels_.

May I ask how should I handle this exception?

gobear6212 avatar Apr 09 '22 13:04 gobear6212

It looks like K-means returned an empty cluster. This is very strange and has not happened during any of my training runs. Can you consistently reproduce the error? Were any model parameters changed?

nachiket92 avatar Apr 10 '22 18:04 nachiket92

I didn't change the model parameters, but I tried to introduce additional edges (e.g. on the left/right of the lane instead of only the proximal ones). This exception only occurs once/twice as far as I recall, so I can't reproduce it. But I suspected that it's related to bad initialization of the clusters, so I removed init='random' from KMeans and let it uses the default k-means++ strategy, which seems to work for now. However, I'm not sure if the same exception will occur again.

gobear6212 avatar Apr 11 '22 09:04 gobear6212

Have you ever meet the same error after that? I removed init='random' as you said but not effect, May I ask how should I handle this exception? @gobear6212

qihuihu20 avatar Jul 18 '22 03:07 qihuihu20