ValueError: Precomputed metric requires shape (n_queries, n_indexed)
I just wanted to bring to your attention this error message. I believe this error is a little misleading because the algorithm works for n_neighbors=15 but not n_neighbors=3. Do you know what it could be in the backend that is preventing it from working for n_neighbors=3 and throwing the shape message?
umap.__version__
0.3.7
# Shape?
print(X.shape)
(5843, 5843)
# Symmetric?
def check_symmetric(a, tol=1e-8):
return np.allclose(a, a.T, atol=tol)
print(check_symmetric(X))
True
# Nulls?
print(np.any(X.isnull()))
False
# Diagonal?
print(np.unique(np.diagonal(X.values)))
[0.]
# UMAP Precomputed
model = UMAP(n_neighbors=3, metric="precomputed")
embeddings = model.fit_transform(X)
Error
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-40-44805956fe15> in <module>
18 # UMAP Precomputed
19 model = UMAP(n_neighbors=3, metric="precomputed")
---> 20 embeddings = model.fit_transform(X)
~/anaconda/envs/µ_env/lib/python3.6/site-packages/umap/umap_.py in fit_transform(self, X, y)
1564 Embedding of the training data in low-dimensional space.
1565 """
-> 1566 self.fit(X, y)
1567 return self.embedding_
1568
~/anaconda/envs/µ_env/lib/python3.6/site-packages/umap/umap_.py in fit(self, X, y)
1536 self.metric,
1537 self._metric_kwds,
-> 1538 self.verbose,
1539 )
1540
~/anaconda/envs/µ_env/lib/python3.6/site-packages/umap/umap_.py in simplicial_set_embedding(data, graph, n_components, initial_alpha, a, b, gamma, negative_sample_rate, n_epochs, init, random_state, metric, metric_kwds, verbose)
941 random_state,
942 metric=metric,
--> 943 metric_kwds=metric_kwds,
944 )
945 expansion = 10.0 / initialisation.max()
~/anaconda/envs/µ_env/lib/python3.6/site-packages/umap/spectral.py in spectral_layout(data, graph, dim, random_state, metric, metric_kwds)
238 random_state,
239 metric=metric,
--> 240 metric_kwds=metric_kwds,
241 )
242
~/anaconda/envs/µ_env/lib/python3.6/site-packages/umap/spectral.py in multi_component_layout(data, graph, n_components, component_labels, dim, random_state, metric, metric_kwds)
120 dim,
121 metric=metric,
--> 122 metric_kwds=metric_kwds,
123 )
124 else:
~/anaconda/envs/µ_env/lib/python3.6/site-packages/umap/spectral.py in component_layout(data, n_components, component_labels, dim, metric, metric_kwds)
51
52 distance_matrix = pairwise_distances(
---> 53 component_centroids, metric=metric, **metric_kwds
54 )
55 affinity_matrix = np.exp(-distance_matrix ** 2)
~/anaconda/envs/µ_env/lib/python3.6/site-packages/sklearn/metrics/pairwise.py in pairwise_distances(X, Y, metric, n_jobs, **kwds)
1381
1382 if metric == "precomputed":
-> 1383 X, _ = check_pairwise_arrays(X, Y, precomputed=True)
1384 return X
1385 elif metric in PAIRWISE_DISTANCE_FUNCTIONS:
~/anaconda/envs/µ_env/lib/python3.6/site-packages/sklearn/metrics/pairwise.py in check_pairwise_arrays(X, Y, precomputed, dtype)
118 "(n_queries, n_indexed). Got (%d, %d) "
119 "for %d indexed." %
--> 120 (X.shape[0], X.shape[1], Y.shape[0]))
121 elif X.shape[1] != Y.shape[1]:
122 raise ValueError("Incompatible dimension for X and Y matrices: "
ValueError: Precomputed metric requires shape (n_queries, n_indexed). Got (291, 5843) for 291 indexed.
Ah, that's the multi-component spectral initialisation failing, because it doesn't support pre-computed metrics. I'm on vacation at the moment, but I can make a better error message when I get back.
It has been a while but this seems to be the cause: https://scikit-learn.org/stable/modules/clustering.html#spectral-clustering
This is SpectralClustering but the same goes for SpectralEmbedding which is used by UMAP. They both expect an affinity/similarity matrix and not a distance matrix.
This could probably be solved by using the solution provided in the link:
similarity = np.exp(-beta * distance / distance.std())
And then passing similarity to SpectralEmbedding within UMAP.
I also came across this problem. I calculated 3 distance matrices with 3 different (custom) metrics. Only one failed. I am not sure wether this makes the other two results wrong, but looking at the solution of sleighsoft they probably are? Yet, the results do not look that wrong. Which is kind of a dangerous thing, then. As a temporary solution I now use init='random', which seems to work.
Hi, sorry, Is this issue being looked into? Otherwise maybe you could suggest methods to recreate original datapoints if you only have a distance matrix? The N(dim) is unknown in my case, but I assume it is possible to find a perfect embedding when selecting N(dim)=N(samples).
Hello,
Is this issue at all being looked into? With the new HDBscan algorithm being implemented into scikit-learn and its impending medoid/centroid features, I would hope somebody would help solve this issue.