UMAP.jl Arbitrary data support

Arbitrary data support

Open ericphanson opened this issue 4 years ago • 10 comments

It would be nice if UMAP.jl didn't require the input to be formatted as X (a column-major matrix of shape (n_features, n_samples)). For example, I have samples that I store as structs with some metadata and a data field, which is stored as a matrix itself. Then I define my distance (a subtype of Distances.SemiMetric) between these structs by grabbing their data fields and calculating a distance measure. It would be nice to be able to just pass this vector of samples and my distance to UMAP. In fact, NearestNeighborDescent already supports this (as long as you also define Distances.result_type(::MyMetric, a, b) = Float32 or such), so I can get it working with UMAP just by removing some type annotations. It seems like UMAP only needs to know the number of original features to check that the output dimension is smaller, and they don't actually need to be formatted as a matrix.

(Of course, I could vectorize all the data matrices and hcat them into a big matrix, and then define my distance measure to reshape them, etc, but it seems unnecessary and complicated).

Does this seem like a reasonable feature for UMAP.jl?

May 15 '20 17:05 ericphanson

Yes, it's a very reasonable feature to have. You're right that UMAP.jl is unnecessarily type constrained, and I have plans in progress for refactoring it to be much more generic - one goal is to support datasets with arbitrary objects.

It should be ready at least to test out in a few weeks, I can ping you once it is if you like.

May 15 '20 17:05 dillondaudert

Sounds great! Thanks for the great package, by the way; I am new to dimensionality reduction but it was pretty easy to get some plots quickly with UMAP.jl and the NearestNeighborDescent.jl :).

May 15 '20 18:05 ericphanson

Just wanted to leave an update that I am still working on this, though it's taking me longer than I initially thought. Turns out the refactoring I had in mind is fairly substantial...

If you're interested, you can take a look at the Wiki pages where there's a high level draft of where things are going.

Jun 09 '20 17:06 dillondaudert

Cool, thanks for the update! Using Tables.jl is an interesting idea. Just to check that I understand: in the case in which I have say a collection of vectors + labels, I would have a 2-column table, right? Not a n+1-column table (where n is the length of the vector).

Unrelatedly, it would be nice if one could get reuse the computations to do KNN searches against the data. Locally, I've hacked around this by doing stuff like

const KNN_MATICES_TYPE = Tuple{Array{Int64,2},Array{Float32,2}}
const KNN_MATRICES = Ref{Union{KNN_MATICES_TYPE, Nothing}}(nothing)

# Avoid recomputing KNN by splicing in our precomputed knn matrices
function UMAP.knn_search(X,
        k,
        metric::MyMetric)
     return KNN_MATRICES[]
end

graph = nndescent(data, n_neighbors, metric) 
KNN_MATRICES[] = NearestNeighborDescent.knn_matrices(graph)

so then I get graph to do KNN searches against, and can also compute a UMAP from there.

It might be nice if the KNN graph could get stored in UMAP for reuse later. I can see how this is tricky from a design point of view because for precomputed distance matrices (or less than 4k points), nndescent is never called. But still I thought I'd mention it while your thinking about designs.

Jun 10 '20 14:06 ericphanson

Yes, the easiest way to handle your example situation is to have a 2-column table, one for the feature vectors and another for the labels in this case. That's the most straightforward way to associate vectors (/points) to metrics, both on the implementation side and in the API. Another way to do it would be to have the user provide a map like metric -> [columns...], then you could support both input representations. I decided to put off supporting that for now since it's quite a bit more complicated.

As for your other question, could you clarify a bit what functionality you're after? Is it that you want to be able to pass a KNNGraph to UMAP (instead of an entire precomputed distance matrix, which is currently supported)? As for storing the knns, they aren't stored as a KNNGraph struct since as you mentioned nndescent isn't always used, but the knn_matrices are stored as of v0.1.6 as UMAP_.knns and UMAP_.dists and this will also be supported moving forward.

Jun 10 '20 14:06 dillondaudert

As for your other question, could you clarify a bit what functionality you're after? Is it that you want to be able to pass a KNNGraph to UMAP (instead of an entire precomputed distance matrix, which is currently supported)? As for storing the knns, they aren't stored as a KNNGraph struct since as you mentioned nndescent isn't always used, but the knn_matrices are stored as of v0.1.6 as UMAP_.knns and UMAP_.dists and this will also be supported moving forward.

I think I was a bit confused since knn_matrices has all the info I would need anyway, so there's no need for an API change to support what I was asking. I think passing a KNNGraph to UMAP instead of a precomputed distance metric would make sense though too.

It seems like there are 3 steps:

From a table of data with n columns, compute n KNN graphs
Compute the fuzzy simplicial complexes and intersect them
Project down and optimize

And it might make sense to have an API to allow starting from step 2 instead of step 1, since you might have those from elsewhere.

Am I right in thinking step 1 is the only place we need a metric?

Jun 10 '20 16:06 ericphanson

Yes, the metrics on the input data only apply to step 1, but they apply at both the fit and transform steps. If you pass in a view of your dataset as precomputed distances when fitting, you'll also have to pass precomputed distances for that same view of any data you want to transform.

Jun 10 '20 19:06 dillondaudert

As an example of the direction things are heading, a verbose API might look like the following...

# we can have any number of these (view_data, view_knn_params) pairs ...
# one view of the data consists points in R^d with the Euclidean metric
view_1_data =[rand(d) for _ in 1:N]
view_1_knn_params = DescentNeighbors(n_neighbors=k, metric=Euclidean(), kwargs=...)

# a second view consists of a precomputed distance matrix
view_2_knn_params = PrecomputedNeighbors(n_neighbors=k, dists=my_dists)

# call fit, passing in named tuples (which are Tables)
umap_result = fit(data=(view_1=view_1_data, view_2=nothing), knn_params=(view_1=view_1_knn_params, view_2=view_2_knn_params), other_params...)

A more user-friendly API can be built on top of this.

Jun 10 '20 19:06 dillondaudert

Cool, looks good to me!

Jun 10 '20 20:06 ericphanson

I saw http://www.cs.cornell.edu/~arb/papers/GLANCE-WWW-2020.pdf today (from https://twitter.com/nassarhuda/status/1319675786085978114?s=21) and it seemed like an interesting way to visualize graphs. As I understand it, UMAP has three steps: calculate knns, construct a weighted graph, project it down. As you know, they have a specific reason for using the projection they do (trying to find a low dimensional structure with a similar topological structure) but it seems like it could still be interesting to try other methods too.

So just as I think it’s worthwhile to allow separate construction of the knns, I think it could be worthwhile to allow separate construction of the projection.

UMAP.jl UMAP.jl copied to clipboard

Arbitrary data support

UMAP.jl
UMAP.jl copied to clipboard