scikit-learn Cosine distance and L1-norm for KMeans

Cosine distance and L1-norm for KMeans

Open mattilyra opened this issue 10 years ago • 7 comments

Neither one of these added metrics has been added to the mini batch kmeans at this point.

Feb 11 '14 17:02 mattilyra

Can you explain why the mean is the same for the cosine distance? With l1 norm this is usually called k medians, right? We are adding k mediods right now, but having k medians might also be interesting.

May 23 '18 19:05 amueller

Wow it took me 4 years to see this :-/

May 23 '18 19:05 amueller

"Can you explain why the mean is the same for the cosine distance?" As it's been 4 year since I did the PR could you provide a little bit more info on what you're referring to. Did you run some data through and discovered that means are the same?

L1-norm should be k-medians, that's correct - there was another PR around the same time as this one, that got merged and then immediately refactored by Gael (?? - I think, but it was 4 years ago).

Jun 04 '18 07:06 mattilyra

We don't have k-medians right now, but I'm pretty sure there's a PR. What I was asking was this: It looks like you're using the euclidean mean to compute the cluster centers for the cosine distance. Is that correct or am I misreading? And if so, why is that the correct way to compute the mean for the cosine distance?

Jun 04 '18 18:06 amueller

Very useful PR indeed as there's no K-median right now. I think amueller's concern is valid, it's not clear if the mean under cosine distance is the same as euclidean mean. I tried and got a closed form solution for the mean, but it's too complex (involving matrix inversion). I am wondering if it is possible to just merge the K-median implementation?

Sep 04 '19 11:09 dk1013

I think it's in scope but would require substantial cleanup and then reviews. So it'll not be a quick thing.

Sep 04 '19 18:09 amueller

Thank you @mattilyra for this work. It makes sense considering K-medians for integration. Unfortunately, the code have significantly changed since you initially submitted this PR. Also the current K-Means implementations are being reconsidered to use a new back-end relying on a broader Cython class hierarchy (see #22587). In this context, it would make sense to first translate the current implementations of K-means under this class hierarchy as to bring modularity to implement K-medians.

Aug 05 '22 14:08 jjerphan

Thank you @mattilyra for this work. It makes sense considering K-medians for integration. Unfortunately, the code have significantly changed since you initially submitted this PR. Also the current K-Means implementations are being reconsidered to use a new back-end relying on a broader Cython class hierarchy (see #22587). In this context, it would make sense to first translate the current implementations of K-means under this class hierarchy as to bring modularity to implement K-medians.

Maybe we should close this and create an issue instead?

Jul 29 '23 08:07 haiatn

Agreed, and @mattilyra might not have the time to finish this now. Closing, can have a fresh PR if we feel like it's needed.

Jul 29 '23 11:07 adrinjalali

scikit-learn scikit-learn copied to clipboard

Cosine distance and L1-norm for KMeans

scikit-learn
scikit-learn copied to clipboard