scikit-learn icon indicating copy to clipboard operation
scikit-learn copied to clipboard

Add Angular distance, a metric version of cosine distance

Open lmcinnes opened this issue 8 years ago • 17 comments

Many algorithms, such as word2vec result in nearest neighbor computations based on cosine similarity. Unfortunately, since cosine (dis)similarity is not a metric it can't be used with kd-trees and ball-trees. This means that algorithms that make use of these structures (e.g. DBSCAN clustering, fast t-SNE, etc.) can't operate with regard to the "appropriate" (dis)similarity measure. Here we add angular (or arccos) distance which is the natural metric analogue of cosine dissimilarity to the valid metrics used for kd-trees and ball-trees. Credit for this work belongs to @brunoalano who submitted a similar change to hdbscan.

lmcinnes avatar Nov 06 '16 02:11 lmcinnes

Please add a test that this works with the trees through NearestNeighbors

jnothman avatar Nov 06 '16 11:11 jnothman

@lmcinnes Here are a few additional benchmarks in pure Python which could give some approximate idea of how this could perform with respect to other metrics (other similar bechmarks can be found here):

  • when data is not normalized, ArccosDistance could perform ~5x slower than the EuclideanDistance (not because of the acos calculation, as I though, but due to the cost of computing the norm)
  • while cosine distance is not a valid metric in general, I believe it becomes a valid metric when it is restricted to vectors on the unit sphere (i.e. L2 normalized), as it's then equal to EucledianDistance²/2. Assuming that the normalization is done beforehand this ArccosDistance could potentially be faster than EucledianDistance. However in this case, it may be simpler to just use a cosine distance metric (e.g. CosineDistanceSpherical ) as there is no reason to apply acos anymore?

What do you think?

rth avatar Nov 08 '16 12:11 rth

Sorry for the delay; I am travelling and have very limited internet access at times.

I agree that this is slower -- we are potentially recomputing quantities repeatedly. It does have the advantage that it "just works" without a requirement that people preprocess data (do the l2 normalizing). Since both kd-trees and ball-trees rely on the triangle inequality to work correctly, the dot product approach is going to result in very strange results on unnormalized data, and there is little chance to warn the user that this is the case -- at least I don't see an obvious way except threading checks for l2 normalization into the actual trees themselves, and that seems less than ideal. While performance is certainly a worthwhile goal, I prefer robustness for users. I'm certainly open to being convinced otherwise however.

lmcinnes avatar Nov 10 '16 08:11 lmcinnes

Okay, tests are added, and I'm happy to entertain options with regard to the computation approach. Is there anything more I can do at this point?

lmcinnes avatar Nov 14 '16 08:11 lmcinnes

@lmcinnes Sorry for late response. I think what could be left is a) adding unit tests for different algorithm options of NearestNeighbors (currently this metric is tested with ball tree but not with kd-tree) and b) trying ddot for better performance, as suggested by @jnothman. Then waiting until this PR is reviewed by a core developer.

P.S: by the way, the scaling benchmarks of different clustering algorithms in the hdbscan documentation are great!

rth avatar Nov 21 '16 12:11 rth

I'm unfortunately very busy with other things for a while. I'll try to get back to this in a week or two.

lmcinnes avatar Nov 30 '16 03:11 lmcinnes

np

On 30 November 2016 at 14:15, Leland McInnes [email protected] wrote:

I'm unfortunately very busy with other things for a while. I'll try to get back to this in a week or two.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/scikit-learn/scikit-learn/pull/7829#issuecomment-263771579, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEz6wRTasYrIiJ7hiI_P7bEPnalApHiks5rDOpRgaJpZM4Kqcxu .

jnothman avatar Nov 30 '16 03:11 jnothman

Out of the box support would be nice. Just checking if there are any updates on this.

UgurKap avatar Jun 05 '20 02:06 UgurKap

I completely forgot about this to be honest, and I think it has been so long it now has a lot of conflicts to resolve. I don't foresee this happening soon, especially since there are other options now with the KNeighborsTransformer API.

lmcinnes avatar Jun 05 '20 04:06 lmcinnes

Still make sense to merge this PR? If yes, I could write the tests requested by @rth

brunoalano avatar Jun 05 '20 04:06 brunoalano

Yeah, actually it seems KNeighborsTransformer is more suited to my needs. So is it more preferable to use KNeighborsTransformer now? I still think it could be a good improvement, even if it does not need to be a priority.

UgurKap avatar Jun 08 '20 15:06 UgurKap

I'm working on finishing this branch.

jgonggrijp avatar Feb 01 '21 11:02 jgonggrijp

take

jgonggrijp avatar Feb 01 '21 11:02 jgonggrijp

As this work was stalled for one year, I just opened a PR intending to carry it on.

ArturoAmorQ avatar Aug 03 '22 14:08 ArturoAmorQ

@ArturoAmorQ are you planning to redo the work or to reuse what I did? Please choose the latter, because I was still planning to finish this.

jgonggrijp avatar Aug 03 '22 19:08 jgonggrijp

@ArturoAmorQ I think that the best approach would be to continue on this PR, since the code it's the same

brunoalano avatar Aug 03 '22 19:08 brunoalano

@brunoalano the code is not the same, I have added new code including tests. See #24106, which I just submitted.

jgonggrijp avatar Aug 04 '22 08:08 jgonggrijp