scikit-learn
scikit-learn copied to clipboard
Add Angular distance, a metric version of cosine distance
Many algorithms, such as word2vec result in nearest neighbor computations based on cosine similarity. Unfortunately, since cosine (dis)similarity is not a metric it can't be used with kd-trees and ball-trees. This means that algorithms that make use of these structures (e.g. DBSCAN clustering, fast t-SNE, etc.) can't operate with regard to the "appropriate" (dis)similarity measure. Here we add angular (or arccos) distance which is the natural metric analogue of cosine dissimilarity to the valid metrics used for kd-trees and ball-trees. Credit for this work belongs to @brunoalano who submitted a similar change to hdbscan.
Please add a test that this works with the trees through NearestNeighbors
@lmcinnes Here are a few additional benchmarks in pure Python which could give some approximate idea of how this could perform with respect to other metrics (other similar bechmarks can be found here):
- when data is not normalized,
ArccosDistance
could perform ~5x slower than theEuclideanDistance
(not because of theacos
calculation, as I though, but due to the cost of computing the norm) - while cosine distance is not a valid metric in general, I believe it becomes a valid metric when it is restricted to vectors on the unit sphere (i.e. L2 normalized), as it's then equal to
EucledianDistance²/2
. Assuming that the normalization is done beforehand thisArccosDistance
could potentially be faster thanEucledianDistance
. However in this case, it may be simpler to just use a cosine distance metric (e.g.CosineDistanceSpherical
) as there is no reason to applyacos
anymore?
What do you think?
Sorry for the delay; I am travelling and have very limited internet access at times.
I agree that this is slower -- we are potentially recomputing quantities repeatedly. It does have the advantage that it "just works" without a requirement that people preprocess data (do the l2 normalizing). Since both kd-trees and ball-trees rely on the triangle inequality to work correctly, the dot product approach is going to result in very strange results on unnormalized data, and there is little chance to warn the user that this is the case -- at least I don't see an obvious way except threading checks for l2 normalization into the actual trees themselves, and that seems less than ideal. While performance is certainly a worthwhile goal, I prefer robustness for users. I'm certainly open to being convinced otherwise however.
Okay, tests are added, and I'm happy to entertain options with regard to the computation approach. Is there anything more I can do at this point?
@lmcinnes Sorry for late response. I think what could be left is a) adding unit tests for different algorithm
options of NearestNeighbors
(currently this metric is tested with ball tree but not with kd-tree) and b) trying ddot
for better performance, as suggested by @jnothman. Then waiting until this PR is reviewed by a core developer.
P.S: by the way, the scaling benchmarks of different clustering algorithms in the hdbscan documentation are great!
I'm unfortunately very busy with other things for a while. I'll try to get back to this in a week or two.
np
On 30 November 2016 at 14:15, Leland McInnes [email protected] wrote:
I'm unfortunately very busy with other things for a while. I'll try to get back to this in a week or two.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/scikit-learn/scikit-learn/pull/7829#issuecomment-263771579, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEz6wRTasYrIiJ7hiI_P7bEPnalApHiks5rDOpRgaJpZM4Kqcxu .
Out of the box support would be nice. Just checking if there are any updates on this.
I completely forgot about this to be honest, and I think it has been so long it now has a lot of conflicts to resolve. I don't foresee this happening soon, especially since there are other options now with the KNeighborsTransformer API.
Still make sense to merge this PR? If yes, I could write the tests requested by @rth
Yeah, actually it seems KNeighborsTransformer is more suited to my needs. So is it more preferable to use KNeighborsTransformer now? I still think it could be a good improvement, even if it does not need to be a priority.
I'm working on finishing this branch.
take
As this work was stalled for one year, I just opened a PR intending to carry it on.
@ArturoAmorQ are you planning to redo the work or to reuse what I did? Please choose the latter, because I was still planning to finish this.
@ArturoAmorQ I think that the best approach would be to continue on this PR, since the code it's the same
@brunoalano the code is not the same, I have added new code including tests. See #24106, which I just submitted.