umap icon indicating copy to clipboard operation
umap copied to clipboard

ValueError: Precomputed metric requires shape (n_queries, n_indexed)

Open jolespin opened this issue 6 years ago • 5 comments

I just wanted to bring to your attention this error message. I believe this error is a little misleading because the algorithm works for n_neighbors=15 but not n_neighbors=3. Do you know what it could be in the backend that is preventing it from working for n_neighbors=3 and throwing the shape message?

umap.__version__
0.3.7

# Shape?
print(X.shape)
​(5843, 5843)

# Symmetric?
def check_symmetric(a, tol=1e-8):
    return np.allclose(a, a.T, atol=tol)
print(check_symmetric(X))
​True

# Nulls?
print(np.any(X.isnull()))
​False

# Diagonal? 
print(np.unique(np.diagonal(X.values)))
​[0.]

# UMAP Precomputed
model = UMAP(n_neighbors=3, metric="precomputed")
embeddings = model.fit_transform(X)

Error

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-40-44805956fe15> in <module>
     18 # UMAP Precomputed
     19 model = UMAP(n_neighbors=3, metric="precomputed")
---> 20 embeddings = model.fit_transform(X)

~/anaconda/envs/µ_env/lib/python3.6/site-packages/umap/umap_.py in fit_transform(self, X, y)
   1564             Embedding of the training data in low-dimensional space.
   1565         """
-> 1566         self.fit(X, y)
   1567         return self.embedding_
   1568 

~/anaconda/envs/µ_env/lib/python3.6/site-packages/umap/umap_.py in fit(self, X, y)
   1536             self.metric,
   1537             self._metric_kwds,
-> 1538             self.verbose,
   1539         )
   1540 

~/anaconda/envs/µ_env/lib/python3.6/site-packages/umap/umap_.py in simplicial_set_embedding(data, graph, n_components, initial_alpha, a, b, gamma, negative_sample_rate, n_epochs, init, random_state, metric, metric_kwds, verbose)
    941             random_state,
    942             metric=metric,
--> 943             metric_kwds=metric_kwds,
    944         )
    945         expansion = 10.0 / initialisation.max()

~/anaconda/envs/µ_env/lib/python3.6/site-packages/umap/spectral.py in spectral_layout(data, graph, dim, random_state, metric, metric_kwds)
    238             random_state,
    239             metric=metric,
--> 240             metric_kwds=metric_kwds,
    241         )
    242 

~/anaconda/envs/µ_env/lib/python3.6/site-packages/umap/spectral.py in multi_component_layout(data, graph, n_components, component_labels, dim, random_state, metric, metric_kwds)
    120             dim,
    121             metric=metric,
--> 122             metric_kwds=metric_kwds,
    123         )
    124     else:

~/anaconda/envs/µ_env/lib/python3.6/site-packages/umap/spectral.py in component_layout(data, n_components, component_labels, dim, metric, metric_kwds)
     51 
     52     distance_matrix = pairwise_distances(
---> 53         component_centroids, metric=metric, **metric_kwds
     54     )
     55     affinity_matrix = np.exp(-distance_matrix ** 2)

~/anaconda/envs/µ_env/lib/python3.6/site-packages/sklearn/metrics/pairwise.py in pairwise_distances(X, Y, metric, n_jobs, **kwds)
   1381 
   1382     if metric == "precomputed":
-> 1383         X, _ = check_pairwise_arrays(X, Y, precomputed=True)
   1384         return X
   1385     elif metric in PAIRWISE_DISTANCE_FUNCTIONS:

~/anaconda/envs/µ_env/lib/python3.6/site-packages/sklearn/metrics/pairwise.py in check_pairwise_arrays(X, Y, precomputed, dtype)
    118                              "(n_queries, n_indexed). Got (%d, %d) "
    119                              "for %d indexed." %
--> 120                              (X.shape[0], X.shape[1], Y.shape[0]))
    121     elif X.shape[1] != Y.shape[1]:
    122         raise ValueError("Incompatible dimension for X and Y matrices: "

ValueError: Precomputed metric requires shape (n_queries, n_indexed). Got (291, 5843) for 291 indexed.

jolespin avatar Jan 06 '19 07:01 jolespin

Ah, that's the multi-component spectral initialisation failing, because it doesn't support pre-computed metrics. I'm on vacation at the moment, but I can make a better error message when I get back.

lmcinnes avatar Jan 07 '19 06:01 lmcinnes

It has been a while but this seems to be the cause: https://scikit-learn.org/stable/modules/clustering.html#spectral-clustering

This is SpectralClustering but the same goes for SpectralEmbedding which is used by UMAP. They both expect an affinity/similarity matrix and not a distance matrix.

This could probably be solved by using the solution provided in the link:

similarity = np.exp(-beta * distance / distance.std())

And then passing similarity to SpectralEmbedding within UMAP.

sleighsoft avatar Aug 02 '19 14:08 sleighsoft

I also came across this problem. I calculated 3 distance matrices with 3 different (custom) metrics. Only one failed. I am not sure wether this makes the other two results wrong, but looking at the solution of sleighsoft they probably are? Yet, the results do not look that wrong. Which is kind of a dangerous thing, then. As a temporary solution I now use init='random', which seems to work.

Pfeil avatar Aug 23 '19 12:08 Pfeil

Hi, sorry, Is this issue being looked into? Otherwise maybe you could suggest methods to recreate original datapoints if you only have a distance matrix? The N(dim) is unknown in my case, but I assume it is possible to find a perfect embedding when selecting N(dim)=N(samples).

Vykintasj avatar Aug 27 '19 15:08 Vykintasj

Hello,

Is this issue at all being looked into? With the new HDBscan algorithm being implemented into scikit-learn and its impending medoid/centroid features, I would hope somebody would help solve this issue.

charliemmm avatar Aug 07 '23 15:08 charliemmm