scikit-learn
scikit-learn copied to clipboard
Cosine distance and L1-norm for KMeans
Neither one of these added metrics has been added to the mini batch kmeans at this point.
Can you explain why the mean is the same for the cosine distance? With l1 norm this is usually called k medians, right? We are adding k mediods right now, but having k medians might also be interesting.
Wow it took me 4 years to see this :-/
"Can you explain why the mean is the same for the cosine distance?" As it's been 4 year since I did the PR could you provide a little bit more info on what you're referring to. Did you run some data through and discovered that means are the same?
L1-norm should be k-medians, that's correct - there was another PR around the same time as this one, that got merged and then immediately refactored by Gael (?? - I think, but it was 4 years ago).
We don't have k-medians right now, but I'm pretty sure there's a PR. What I was asking was this: It looks like you're using the euclidean mean to compute the cluster centers for the cosine distance. Is that correct or am I misreading? And if so, why is that the correct way to compute the mean for the cosine distance?
Very useful PR indeed as there's no K-median right now. I think amueller's concern is valid, it's not clear if the mean under cosine distance is the same as euclidean mean. I tried and got a closed form solution for the mean, but it's too complex (involving matrix inversion). I am wondering if it is possible to just merge the K-median implementation?
I think it's in scope but would require substantial cleanup and then reviews. So it'll not be a quick thing.
Thank you @mattilyra for this work. It makes sense considering K-medians for integration. Unfortunately, the code have significantly changed since you initially submitted this PR. Also the current K-Means implementations are being reconsidered to use a new back-end relying on a broader Cython class hierarchy (see #22587). In this context, it would make sense to first translate the current implementations of K-means under this class hierarchy as to bring modularity to implement K-medians.
Thank you @mattilyra for this work. It makes sense considering K-medians for integration. Unfortunately, the code have significantly changed since you initially submitted this PR. Also the current K-Means implementations are being reconsidered to use a new back-end relying on a broader Cython class hierarchy (see #22587). In this context, it would make sense to first translate the current implementations of K-means under this class hierarchy as to bring modularity to implement K-medians.
Maybe we should close this and create an issue instead?
Agreed, and @mattilyra might not have the time to finish this now. Closing, can have a fresh PR if we feel like it's needed.