uwot icon indicating copy to clipboard operation
uwot copied to clipboard

Metric = "precomputed" is not implimented

Open rach226a opened this issue 5 years ago • 6 comments

Metric = "precomputed" is not implemented

I would like to run uwot::umap() with metric = 'pearson'. However, 'pearson' is not an option with within this package and I got the following error:

Error in match.arg(metric, c("euclidean", "cosine", "manhattan", "hamming", : 'arg' should be one of “euclidean”, “cosine”, “manhattan”, “hamming”, “precomputed”

This error suggests that I can use a "precomputed" distance matrix. So I tried to run uwot::umap() with metric = 'precomputed' and got the following error:

Error in create_ann(metric, nc) : BUG: unknown Annoy metric 'precomputed'

This error suggests precomputed is not implemented within this package.

PS. The original umap package allows for metrix = 'pearson.' It would be nice to see this added to this package!

rach226a avatar Feb 25 '19 16:02 rach226a

metric = "precomputed" is for use with nearest neighbor data, so it requires a list of two matrices, the nearest neighbor indices and the distances. There are some details at https://github.com/jlmelville/uwot#nearest-neighbor-data-format.

It may be the case that uwot can already do what you want. If you have created a full distance matrix yourself, then if you convert it to a dist object, you can pass it directly to the X parameter of uwot without specifying metric = "precomputed", e.g.:

iris_dist <- dist(iris[, -5])
iris_umap <- umap(iris_dist)

I do see that metric = "precomputed" causes an error in the above case, so I will fix that. If you can provide an example of the input you were trying to use, I will try to improve the error reporting for this code path.

Thank you for the suggestion about other metrics and the vote for Pearson. I would also like to see more, but uwot relies on the metrics that Annoy supports. It's possible that I will get more of the neighbor search part of PyNNDescent implemented in R and then more metrics will be available.

jlmelville avatar Feb 25 '19 16:02 jlmelville

Thank you for the suggestions. I have successfully run uwot::umap() with Pearson correlation via nn_method = list(idx = index_matrix, dist = dist_matrix) and via uwot::umap(dist(dist_matrix), metric = "precomputed"). My dist_matrix and index_matrix were created with Pearson correlation. Unfortunately, I wanted to do metric learning which isn't possible through this implementation.

rach226a avatar Feb 25 '19 20:02 rach226a

Although I suspect that this is way too late for @rach226a purposes, I am temporarily re-opening to note that:

  1. #64 now allows for transforming new data with precomputed nearest neighbor data, and metric learning works as part of that:

    devtools::install_github("jlmelville/vizier")
    devtools::install_github("jlmelville/snedata")
    fashion <- snedata::download_fashion_mnist()
    fashion_train <- head(fashion, 60000)
    fashion_test <- tail(fashion, 10000)
    
    # calculate the nearest neighbors outside of uwot (pretend the function isn't the implementation in uwot)
    fashion_train.nn <- uwot:::annoy_nn(X = as.matrix(fashion_train[, 1:784]), k = 15, metric = "cosine", ret_index = TRUE)
    # return umap map with annoy_nn input
    set.seed(1337)
    fashion_umap <- uwot::umap(X = NULL, nn_method = fashion_train.nn, ret_model = TRUE, y = fashion_train$Label)
    
    # compute the query-reference annoy_nn
    query_ref.nn <- uwot:::annoy_search(X = as.matrix(fashion_test[, 1:784]), k = 15, ann = fashion_train.nn$index  )
    
    # use the query-reference annoy_nn to transform query to reference
    fashion_umap_test <- uwot::umap_transform(X = NULL,  model = fashion_umap, nn_method = query_ref.nn)
    
    vizier::embed_plot(fashion_umap$embedding, fashion_train, cex = 0.5, title = "Fashion UMAP", alpha_scale = 0.075)
    vizier::embed_plot(fashion_umap_test, fashion_test, cex = 0.5, title = "Fashion Test UMAP", alpha_scale = 0.075)
    
  2. Pearson correlation distance is the same as using cosine distance with each row normalized to zero mean, so it's already available in uwot at the cost of a little work up front:

    devtools::install_github("jlmelville/vizier")
    devtools::install_github("jlmelville/snedata")
    fashion <- snedata::download_fashion_mnist()
    fashion_train <- head(fashion, 60000)
    fashion_test <- tail(fashion, 10000)
    
    # subtract mean from each row
    fashion_trainm <- as.matrix(fashion_train[, 1:784])
    fashion_trainm <- fashion_trainm - apply(fashion_trainm, 1, mean)
    fashion_testm <- as.matrix(fashion_test[, 1:784])
    fashion_testm <- fashion_testm - apply(fashion_testm, 1, mean)
    
    fashion_umap <- uwot::umap(fashion_trainm, metric = "cosine", ret_model = TRUE, y = fashion_train$Label, verbose = TRUE)
    fashion_umap_test <- uwot::umap_transform(fashion_testm, model = fashion_umap)
    
    vizier::embed_plot(fashion_umap$embedding, fashion_train, cex = 0.5, title = "Fashion UMAP (Correlation)", alpha_scale = 0.075)
    vizier::embed_plot(fashion_umap_test, fashion_test, cex = 0.5, title = "Fashion Test UMAP (Correlation)", alpha_scale = 0.075)
    

But it would be better for uwot to do this work internally, and add a metric = "correlation" option.

jlmelville avatar Jul 19 '20 21:07 jlmelville

I have a distance matrix calculated with a non-supported metric (earth mover distance). How can I get it into the required format? I tried str(fashion_train.nn) from your first example in an attempt to reverse-engineer the format, but it is complex enough so that it's not obvious what is required to move from a square symmetric matrix to that format. Thanks in advance for any help.

dkatztibco avatar May 20 '21 18:05 dkatztibco

To carry out UMAP successfully your NN data should be be in the form of a list consisting of two N x k matrices, where N is the number of points in the data set and k is the number of nearest neighbors. Matrix idx contains the indices of the neighbors of point i in row i. Matrix dist contains the equivalent distances.

If you have full dense N x N distance matrix, then there is an internal function you can use, uwot:::dist_nn, that will carry out the conversion for you, e.g.:

iris10 <- as.matrix(iris[1:10, -5])
iris10_dm <- as.matrix(dist(iris10))
# get 4 nearest neighbors
iris10_nn <- uwot:::dist_nn(iris10_dm, k = 4)

jlmelville avatar May 21 '21 14:05 jlmelville

Thanks!

David Katz, TIBCO Data Science

dkatztibco avatar May 21 '21 18:05 dkatztibco

Using precomputed nearest neighbors is covered at https://jlmelville.github.io/uwot/articles/hnsw-umap.html and https://jlmelville.github.io/uwot/articles/rnndescent-umap.html. Pearson correlation is now supported with metric = "correlation".

jlmelville avatar Mar 18 '24 02:03 jlmelville