spikeinterface icon indicating copy to clipboard operation
spikeinterface copied to clipboard

UMAP instead of PCA

Open florian6973 opened this issue 1 year ago • 4 comments

Hi!

I was playing a bit with spikecomponents, and I was wondering if someone was already working on it, or if it could be worth it that I add a clustering class (https://github.com/SpikeInterface/spikeinterface/blob/main/src/spikeinterface/sortingcomponents/clustering/method_list.py), with UMAP as a dimensionality reduction method. Indeed, this thesis shows that UMAP is promising https://dalspace.library.dal.ca/handle/10222/83717 if we manage to compute it efficiently.

Best,

Florent

florian6973 avatar Jul 01 '24 20:07 florian6973

@samuelgarcia @yger any opinions on this?

zm711 avatar Jul 02 '24 12:07 zm711

UMAP could easily be added in the sortingcomponents framework, as an additionnal step to project the waveforms. The problem with UMAP is that this is rather slow for large number of spikes/features, and this might also be highly dependent on the parameters. But I'll read the ref, and it has been shown to be usefu, then clearly this must be integrated!

yger avatar Jul 02 '24 12:07 yger

Yes you are right it can be rather slow, but I was recently checking some reimplementation on GPU which is much faster for large number of samples (https://github.com/berenslab/contrastive-ne or https://github.com/rapidsai/cuml ...)

Regarding the sensitivity to the parameters, I was not aware that UMAP is very dependent on parameters... Could you elaborate please? Thanks!

florian6973 avatar Jul 02 '24 20:07 florian6973

Just chiming in since I've done related research but yes UMAP tends to be highly variable with respect to several parameters like min_dist or n_neighbors. UMAP is also stochastic by default unless a random seed is set. However, a distinction needs to be made in UMAP as a graph construction algorithm (which uses a locally-varying distance metric) and UMAP as a projection algorithm (a force-directed graph layout procedure). When people speak of UMAP being "variable", they typically mean the latter. However, I've shown in previous work that UMAP tends to be much more consistent in terms of the former (its graph construction and clustering upon this) than, say, a clustering algorithm applied to PCA capturing most of the variance. For the spike sorting problem in particular, this work shows UMAP's utility. In short, clustering on the high-dimensional graph is more consistent than clustering on the embedding in many cases because the former doesn't suffer from the additional stochasticity/variability induced by the embedding procedure.

However, I would like to point out that (afaik) it has not been robustly demonstrated that UMAP's graph constructions are empirically superior to other methods such as simple nearest neighbor graphs (in which the local distance metric does not vary) for the spike sorting problem in particular. It does deserve mention that Kilosort4, which seems to be a large improvement over previous versions, does indeed use a nearest neighbor graph construction with hierarchical graph clustering (called "community detection").

This all to say that something like a Leiden clustering on a UMAP graph construction will likely work well.

EricKenjiLee avatar Oct 08 '24 21:10 EricKenjiLee