High dimensionality of data VS parallelism

Open versis opened this issue 4 years ago • 1 comments

I have about 2 mln data points with 768 dimensions. So far I've been testing on subset of data (up to 100 000), but I see the time it takes is growing fast.

While dimensionality reduction is one thing, I was wondering whether parallelism can help here?

As I understand parallelism is only used for calculating core distances, which will take longer if we have more dimension. But what happens next - after the algorithm switches to one core/thread and do its "stuff"? Is the dimension still impacting the speed or the algorithm is using core distances so it doesn't matter?

Mar 19 '21 10:03 versis

The next step is constructing a minimal spanning tree under mutual reachability distance. This can be done in a couple of ways: the first involves the use of spatial indexing structures (kd-trees or ball-trees) to provide fast nearest neighbor lookups for a Boruvka based MST algorithm which can avoid a significant number of distance computations; the second involves using a Prims based MST algorithm on effectively the full all pairs distance matrix (although it is calculated on an as needed basis so is never all in memory at once). The first option is by far the best, but kd-trees and ball-trees do not scale well beyond 50 dimensions or so, and so for 768 dimensional data it is necessary to fall back to the Prims method -- this is vastly more computationally expensive (O(N^2) vs O(N log N)). In terms of parallelism: there exist ways to perform parallel versions of Prims, but they are decidedly non-trivial. Prims proceeds by steadily growing the tree starting from a single point and adding the best edge to the current tree (fundamentally a serial process); parallel Prims uses multiple starts in parallel, but then has tricky operations to merge trees together as they meet. You could try coding this up based on the Prims implementation in hdbscan, but it would require non-trivial amounts of work (the reason it was not implemented in the first place).

In terms of performance you are far better off getting the data dimension size down to something that allows the Boruvka/spatial indexing version to run. UMAP may be a good choice here. It is worth keeping in mind that density is a tricky concept in very high dimensional spaces, and it not at all what your intuitions would first expect. In general density based clustering in very high dimensional spaces is not advisable without a lot of data (i.e. exponentially much data compared to the number of dimensions -- I don't believe it will be particularly tractable for 768 dimensions). The better approach is to hope there is an underlying structure of lower intrinsic dimension, and trying to extract that first. So realistically, from a purely theoretical perspective on the utility of density based methods in high dimensions, you likely want to use dimension reduction anyway to get meaningful clusters out.

Mar 19 '21 16:03 lmcinnes