bbknn icon indicating copy to clipboard operation
bbknn copied to clipboard

Add cosine distance as valid metric

Open olgabot opened this issue 3 years ago • 4 comments

Hello, When running this tool recently, I get errors with "angular" not being a valid metric anymore. It seems this is replaced with "cosine" in both Scipy and Scikit-learn, so this PR updates the default metric to be "cosine" instead of "angular."

Allows cosine distance to be set at the metric. scipy.spatial.distance.cosine returns 1 - cosine simillarity which is equivalent to angular distance, and thus is the same thing as setting angular as the metric

ValueError: Unknown metric angular. Valid metrics are ['euclidean', 'l2', 'l1', 'manhattan', 'cityblock', 'braycurtis', 'canberra', 'chebyshev', 'correlation', 'cosine', 'dice', 'hamming', 'jaccard', 'kulsinski', 'mahalanobis', 'matching', 'minkowski', 'rogerstanimoto', 'russellrao', 'seuclidean', 'sokalmichener', 'sokalsneath', 'sqeuclidean', 'yule', 'wminkowski', 'nan_euclidean', 'haversine'], or 'precomputed', or a callable
Full error message
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-13-4ea1dfbcbec4> in <module>
      1 sc.external.pp.bbknn(preprocessed, batch_key='species_batch', n_pcs=15, metric='cosine')
      2 
----> 3 sc.tl.umap(preprocessed)
      4 sc.pl.umap(preprocessed, **umap_plot_kws)

~/miniconda3/envs/tabula-microcebus-jan2021/lib/python3.7/site-packages/scanpy/tools/_umap.py in umap(adata, min_dist, spread, n_components, maxiter, alpha, gamma, negative_sample_rate, init_pos, random_state, a, b, copy, method, neighbors_key)
    171             neigh_params.get('metric', 'euclidean'),
    172             neigh_params.get('metric_kwds', {}),
--> 173             verbose=settings.verbosity > 3,
    174         )
    175     elif method == 'rapids':

~/miniconda3/envs/tabula-microcebus-jan2021/lib/python3.7/site-packages/umap/umap_.py in simplicial_set_embedding(data, graph, n_components, initial_alpha, a, b, gamma, negative_sample_rate, n_epochs, init, random_state, metric, metric_kwds, output_metric, output_metric_kwds, euclidean_output, parallel, verbose)
   1037             random_state,
   1038             metric=metric,
-> 1039             metric_kwds=metric_kwds,
   1040         )
   1041         expansion = 10.0 / np.abs(initialisation).max()

~/miniconda3/envs/tabula-microcebus-jan2021/lib/python3.7/site-packages/umap/spectral.py in spectral_layout(data, graph, dim, random_state, metric, metric_kwds)
    304             random_state,
    305             metric=metric,
--> 306             metric_kwds=metric_kwds,
    307         )
    308 

~/miniconda3/envs/tabula-microcebus-jan2021/lib/python3.7/site-packages/umap/spectral.py in multi_component_layout(data, graph, n_components, component_labels, dim, random_state, metric, metric_kwds)
    191             random_state,
    192             metric=metric,
--> 193             metric_kwds=metric_kwds,
    194         )
    195     else:

~/miniconda3/envs/tabula-microcebus-jan2021/lib/python3.7/site-packages/umap/spectral.py in component_layout(data, n_components, component_labels, dim, random_state, metric, metric_kwds)
    120             else:
    121                 distance_matrix = pairwise_distances(
--> 122                     component_centroids, metric=metric, **metric_kwds
    123                 )
    124 

~/miniconda3/envs/tabula-microcebus-jan2021/lib/python3.7/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
     70                           FutureWarning)
     71         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72         return f(**kwargs)
     73     return inner_f
     74 

~/miniconda3/envs/tabula-microcebus-jan2021/lib/python3.7/site-packages/sklearn/metrics/pairwise.py in pairwise_distances(X, Y, metric, n_jobs, force_all_finite, **kwds)
   1738         raise ValueError("Unknown metric %s. "
   1739                          "Valid metrics are %s, or 'precomputed', or a "
-> 1740                          "callable" % (metric, _VALID_METRICS))
   1741 
   1742     if metric == "precomputed":

ValueError: Unknown metric angular. Valid metrics are ['euclidean', 'l2', 'l1', 'manhattan', 'cityblock', 'braycurtis', 'canberra', 'chebyshev', 'correlation', 'cosine', 'dice', 'hamming', 'jaccard', 'kulsinski', 'mahalanobis', 'matching', 'minkowski', 'rogerstanimoto', 'russellrao', 'seuclidean', 'sokalmichener', 'sokalsneath', 'sqeuclidean', 'yule', 'wminkowski', 'nan_euclidean', 'haversine'], or 'precomputed', or a callable

olgabot avatar Jan 28 '21 17:01 olgabot

Let me know if you have any other suggestions! The other PR, https://github.com/Teichlab/bbknn/pull/35, is a formatting PR and may be better to merge first, then I can apply the Black formatting to this code.

olgabot avatar Jan 28 '21 17:01 olgabot

scipy.spatial.distance.cosine() is not equivalent to whatever annoy does internally.

>>> data = np.random.random((4,5))
>>> data
array([[0.51430605, 0.93320931, 0.50650078, 0.11883229, 0.68305612],
	   [0.74908536, 0.70106014, 0.28218083, 0.50149186, 0.92821838],
	   [0.92428807, 0.43345467, 0.44748583, 0.23497894, 0.15046774],
	   [0.52700989, 0.08315552, 0.49098297, 0.78154297, 0.51581562]])
>>> ckd = AnnoyIndex(data.shape[1],metric='angular')
>>> for i in np.arange(data.shape[0]):
...     ckd.add_item(i,data[i,:])
... 
>>> ckd.build(10)
True
>>> ckd.get_nns_by_vector(data[0,:],5,include_distances=True)
([0, 1, 2, 3], [0.0, 0.41253361105918884, 0.6529256105422974, 0.8446558117866516])
>>> scipy.spatial.distance.cosine(data[0,:], data[1,:])
0.08509201761923013

As per annoy, the distance is 0.412. As per scipy/sklearn, the distance is 0.085.


Additionally, the issue you're encountering is the manifestation of something else happening. UMAP only seems to error out this way on select data. My guess is that it kicks in the spectral component and tries to run some distance stuff on its own when it deems the input too disjoint. As such, it goes to retrieve the metric, sees angular, and goes up in flames.

The easiest fix is to change the default metric to Euclidean, as that's something that everything speaks, including UMAP's spectral stitching thing. However, while this makes things technically run, the stitched together manifold turns into a clump.

image

A way to avoid this spectral thing kicking in is increasing neighbors_within_batch. This creates a more interconnected graph, but also prioritises local structure over global structure. There tends to be more batch effect present in the results. Here's the same data, ran with annoy's angular, with neighbors_within_batch=10. UMAP didn't do the spectral thing, didn't need the metric, everything ran, it's less clumpy than the prior one, but even more batched up.

image

I don't remember BBKNN/UMAP acting like this when I was developing the algorithm, and I'm unsure whether this is due to changes UMAP-side or me just being very fortunate with the data I was working with. I'm tempted to try to consult the UMAP folks for assistance on the matter.

ktpolanski avatar Feb 02 '21 11:02 ktpolanski

I'm having the same issue. AFAIK UMAP fit in scanpy takes the weighted adjacency matrix from neighbors and does not recalculate the distances, hence it seems to me it may be a sanity check on parameters as passed to UMAP by scanpy. To make sc.tl.umap work I manually set adata.uns['neighbors']['params']['metric'] to cosine after bbknn has finished. Results are usually very much consistent. Also, as for the neighbors_within_batch, I prefer to scale it to the size of the dataset, usually int(np.sqrt(adata.shape[0])/2/len(batches)).

dawe avatar Feb 18 '21 10:02 dawe

I've band-aided over the issue by swapping the metric to Euclidean. However, there's something weird afoot as evidenced by the gloopy UMAP I sent earlier.

The package also supports pynndescent now, which is UMAP's knn algorithm of choice.

ktpolanski avatar Jun 02 '21 17:06 ktpolanski