spherecluster icon indicating copy to clipboard operation
spherecluster copied to clipboard

Initialization is using euclidean distance

Open xinyangz opened this issue 6 years ago • 3 comments

https://github.com/jasonlaska/spherecluster/blob/701b0b1909088a56e353b363b2672580d4fe9d93/spherecluster/spherical_kmeans.py#L44-L48

I might be getting this wrong, but the code here seems to be using initialization function from sklearn. This could cause issue since the kmeans++ initialization in sklearn is based on euclidean distance. It should be replaced with cosine distance.

xinyangz avatar Jan 20 '19 00:01 xinyangz

Thank you, I'll look into it.

jasonlaska avatar Jan 20 '19 00:01 jasonlaska

May I ask what the status of this issue is? I looked into the commits log but to no avail. I am currently using init="random" as a safer option, and would like to switch to k-means++ if this issue is resolved.

Many thanks!

rtrad89 avatar Jul 16 '19 22:07 rtrad89

Hi @rtrad89 or @t103z I'm not sure how much this will actually matter since this is just used for intiailization (and sklearn has chosen not to provide a distance-configurable version of kmeans++).

The code for the initialization is here: https://github.com/scikit-learn/scikit-learn/blob/0.22.X/sklearn/cluster/_k_means.py#L41, do you want to take a stab at a PR to port it for cosine dist?

jasonlaska avatar Dec 27 '19 23:12 jasonlaska