UMAP instead of PCA
Hi!
I was playing a bit with spikecomponents, and I was wondering if someone was already working on it, or if it could be worth it that I add a clustering class (https://github.com/SpikeInterface/spikeinterface/blob/main/src/spikeinterface/sortingcomponents/clustering/method_list.py), with UMAP as a dimensionality reduction method. Indeed, this thesis shows that UMAP is promising https://dalspace.library.dal.ca/handle/10222/83717 if we manage to compute it efficiently.
Best,
Florent
@samuelgarcia @yger any opinions on this?
UMAP could easily be added in the sortingcomponents framework, as an additionnal step to project the waveforms. The problem with UMAP is that this is rather slow for large number of spikes/features, and this might also be highly dependent on the parameters. But I'll read the ref, and it has been shown to be usefu, then clearly this must be integrated!
Yes you are right it can be rather slow, but I was recently checking some reimplementation on GPU which is much faster for large number of samples (https://github.com/berenslab/contrastive-ne or https://github.com/rapidsai/cuml ...)
Regarding the sensitivity to the parameters, I was not aware that UMAP is very dependent on parameters... Could you elaborate please? Thanks!
Just chiming in since I've done related research but yes UMAP tends to be highly variable with respect to several parameters like min_dist or n_neighbors. UMAP is also stochastic by default unless a random seed is set. However, a distinction needs to be made in UMAP as a graph construction algorithm (which uses a locally-varying distance metric) and
UMAP as a projection algorithm (a force-directed graph layout procedure). When people speak of UMAP being "variable", they typically mean the latter. However, I've shown in previous work that UMAP tends to be much more consistent in terms of the former (its graph construction and clustering upon this) than, say, a clustering algorithm applied to PCA capturing most of the variance. For the spike sorting problem in particular, this work shows UMAP's utility. In short, clustering on the high-dimensional graph is more consistent than clustering on the embedding in many cases because the former doesn't suffer from the additional stochasticity/variability induced by the embedding procedure.
However, I would like to point out that (afaik) it has not been robustly demonstrated that UMAP's graph constructions are empirically superior to other methods such as simple nearest neighbor graphs (in which the local distance metric does not vary) for the spike sorting problem in particular. It does deserve mention that Kilosort4, which seems to be a large improvement over previous versions, does indeed use a nearest neighbor graph construction with hierarchical graph clustering (called "community detection").
This all to say that something like a Leiden clustering on a UMAP graph construction will likely work well.