bbknn
bbknn copied to clipboard
Add cosine distance as valid metric
Hello, When running this tool recently, I get errors with "angular" not being a valid metric anymore. It seems this is replaced with "cosine" in both Scipy and Scikit-learn, so this PR updates the default metric to be "cosine" instead of "angular."
Allows cosine distance to be set at the metric. scipy.spatial.distance.cosine returns 1 - cosine simillarity
which is equivalent to angular distance, and thus is the same thing as setting angular
as the metric
ValueError: Unknown metric angular. Valid metrics are ['euclidean', 'l2', 'l1', 'manhattan', 'cityblock', 'braycurtis', 'canberra', 'chebyshev', 'correlation', 'cosine', 'dice', 'hamming', 'jaccard', 'kulsinski', 'mahalanobis', 'matching', 'minkowski', 'rogerstanimoto', 'russellrao', 'seuclidean', 'sokalmichener', 'sokalsneath', 'sqeuclidean', 'yule', 'wminkowski', 'nan_euclidean', 'haversine'], or 'precomputed', or a callable
Full error message
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-13-4ea1dfbcbec4> in <module>
1 sc.external.pp.bbknn(preprocessed, batch_key='species_batch', n_pcs=15, metric='cosine')
2
----> 3 sc.tl.umap(preprocessed)
4 sc.pl.umap(preprocessed, **umap_plot_kws)
~/miniconda3/envs/tabula-microcebus-jan2021/lib/python3.7/site-packages/scanpy/tools/_umap.py in umap(adata, min_dist, spread, n_components, maxiter, alpha, gamma, negative_sample_rate, init_pos, random_state, a, b, copy, method, neighbors_key)
171 neigh_params.get('metric', 'euclidean'),
172 neigh_params.get('metric_kwds', {}),
--> 173 verbose=settings.verbosity > 3,
174 )
175 elif method == 'rapids':
~/miniconda3/envs/tabula-microcebus-jan2021/lib/python3.7/site-packages/umap/umap_.py in simplicial_set_embedding(data, graph, n_components, initial_alpha, a, b, gamma, negative_sample_rate, n_epochs, init, random_state, metric, metric_kwds, output_metric, output_metric_kwds, euclidean_output, parallel, verbose)
1037 random_state,
1038 metric=metric,
-> 1039 metric_kwds=metric_kwds,
1040 )
1041 expansion = 10.0 / np.abs(initialisation).max()
~/miniconda3/envs/tabula-microcebus-jan2021/lib/python3.7/site-packages/umap/spectral.py in spectral_layout(data, graph, dim, random_state, metric, metric_kwds)
304 random_state,
305 metric=metric,
--> 306 metric_kwds=metric_kwds,
307 )
308
~/miniconda3/envs/tabula-microcebus-jan2021/lib/python3.7/site-packages/umap/spectral.py in multi_component_layout(data, graph, n_components, component_labels, dim, random_state, metric, metric_kwds)
191 random_state,
192 metric=metric,
--> 193 metric_kwds=metric_kwds,
194 )
195 else:
~/miniconda3/envs/tabula-microcebus-jan2021/lib/python3.7/site-packages/umap/spectral.py in component_layout(data, n_components, component_labels, dim, random_state, metric, metric_kwds)
120 else:
121 distance_matrix = pairwise_distances(
--> 122 component_centroids, metric=metric, **metric_kwds
123 )
124
~/miniconda3/envs/tabula-microcebus-jan2021/lib/python3.7/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
70 FutureWarning)
71 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72 return f(**kwargs)
73 return inner_f
74
~/miniconda3/envs/tabula-microcebus-jan2021/lib/python3.7/site-packages/sklearn/metrics/pairwise.py in pairwise_distances(X, Y, metric, n_jobs, force_all_finite, **kwds)
1738 raise ValueError("Unknown metric %s. "
1739 "Valid metrics are %s, or 'precomputed', or a "
-> 1740 "callable" % (metric, _VALID_METRICS))
1741
1742 if metric == "precomputed":
ValueError: Unknown metric angular. Valid metrics are ['euclidean', 'l2', 'l1', 'manhattan', 'cityblock', 'braycurtis', 'canberra', 'chebyshev', 'correlation', 'cosine', 'dice', 'hamming', 'jaccard', 'kulsinski', 'mahalanobis', 'matching', 'minkowski', 'rogerstanimoto', 'russellrao', 'seuclidean', 'sokalmichener', 'sokalsneath', 'sqeuclidean', 'yule', 'wminkowski', 'nan_euclidean', 'haversine'], or 'precomputed', or a callable
Let me know if you have any other suggestions! The other PR, https://github.com/Teichlab/bbknn/pull/35, is a formatting PR and may be better to merge first, then I can apply the Black formatting to this code.
scipy.spatial.distance.cosine()
is not equivalent to whatever annoy does internally.
>>> data = np.random.random((4,5))
>>> data
array([[0.51430605, 0.93320931, 0.50650078, 0.11883229, 0.68305612],
[0.74908536, 0.70106014, 0.28218083, 0.50149186, 0.92821838],
[0.92428807, 0.43345467, 0.44748583, 0.23497894, 0.15046774],
[0.52700989, 0.08315552, 0.49098297, 0.78154297, 0.51581562]])
>>> ckd = AnnoyIndex(data.shape[1],metric='angular')
>>> for i in np.arange(data.shape[0]):
... ckd.add_item(i,data[i,:])
...
>>> ckd.build(10)
True
>>> ckd.get_nns_by_vector(data[0,:],5,include_distances=True)
([0, 1, 2, 3], [0.0, 0.41253361105918884, 0.6529256105422974, 0.8446558117866516])
>>> scipy.spatial.distance.cosine(data[0,:], data[1,:])
0.08509201761923013
As per annoy, the distance is 0.412. As per scipy/sklearn, the distance is 0.085.
Additionally, the issue you're encountering is the manifestation of something else happening. UMAP only seems to error out this way on select data. My guess is that it kicks in the spectral component and tries to run some distance stuff on its own when it deems the input too disjoint. As such, it goes to retrieve the metric, sees angular, and goes up in flames.
The easiest fix is to change the default metric to Euclidean, as that's something that everything speaks, including UMAP's spectral stitching thing. However, while this makes things technically run, the stitched together manifold turns into a clump.
A way to avoid this spectral thing kicking in is increasing neighbors_within_batch
. This creates a more interconnected graph, but also prioritises local structure over global structure. There tends to be more batch effect present in the results. Here's the same data, ran with annoy's angular, with neighbors_within_batch=10
. UMAP didn't do the spectral thing, didn't need the metric, everything ran, it's less clumpy than the prior one, but even more batched up.
I don't remember BBKNN/UMAP acting like this when I was developing the algorithm, and I'm unsure whether this is due to changes UMAP-side or me just being very fortunate with the data I was working with. I'm tempted to try to consult the UMAP folks for assistance on the matter.
I'm having the same issue. AFAIK UMAP fit in scanpy takes the weighted adjacency matrix from neighbors and does not recalculate the distances, hence it seems to me it may be a sanity check on parameters as passed to UMAP by scanpy. To make sc.tl.umap
work I manually set adata.uns['neighbors']['params']['metric']
to cosine
after bbknn
has finished. Results are usually very much consistent. Also, as for the neighbors_within_batch
, I prefer to scale it to the size of the dataset, usually int(np.sqrt(adata.shape[0])/2/len(batches))
.
I've band-aided over the issue by swapping the metric to Euclidean. However, there's something weird afoot as evidenced by the gloopy UMAP I sent earlier.
The package also supports pynndescent now, which is UMAP's knn algorithm of choice.