umap icon indicating copy to clipboard operation
umap copied to clipboard

Jaccard transformation function clarification

Open Rridley7 opened this issue 1 year ago • 1 comments

I had a quick clarification question about the jaccard distance in this package vs. the scipy spatial version, when considering non-binary data. The version in this package: https://github.com/lmcinnes/umap/blob/5c79fa60ce536405339da227bfd885635b68735d/umap/distances.py#L382

jaccard(np.array([1,5,0,1]),np.array([1,1.45,0,1]))
## 0.0

The version in scipy

scipy.spatial.distance.jaccard(np.array([1,5,0,1]),np.array([1,1.45,0,1]))
## 0.333

When running UMAP, which of these versions is referenced when calculating distances via jaccard?

As an aside, I noticed that inputting a large matrix (~8M x 300) of non-binary data will run much faster than if it is first converted to binary observances, such as array.astype(bool).astype(int) . This is what led me to check for this difference between the two functions.

Rridley7 avatar May 27 '23 04:05 Rridley7

When running umap it will either be the one you cite first, or this one from pynndescent for the most part. It is possible that for small datasets (the cutoff is a somewhat arbitrary 4096 samples) you may get the scipy version since in those cases UMAP just uses sklearn's pairwise_distances to compute the full distance matrix.

I'm not sure what it going on with the scipy version for the data you cite. You can provide a weight vector to do weighted jaccard, but that's a third argument so, to my mind, I can't see how you can get anything but a 0 jaccard distance since the two vectors, despite having different values, share exactly the same non-zeros.

lmcinnes avatar May 28 '23 15:05 lmcinnes