unsupervised_analysis
unsupervised_analysis copied to clipboard
A general purpose Snakemake workflow and MrBiomics module to perform unsupervised analyses (dimensionality reduction & cluster analysis) and visualizations of high-dimensional data.
test it for e.g., pca.py **Dask**: Dask is a parallel computing library that integrates with pandas, NumPy, and scikit-learn. It can handle larger-than-memory datasets and can distribute the computation across...
define too large: e.g., >10,000 samples/cells? ideas - for large data (define too large?) do not do heatmaps showing features and data, but instead determine distance matrices and show those...
- idea: a barplot ordered by number of clusters within each clustering - [ ] research alternatives that are common in the field
determine metrics at every iteration and plot at the end the time course. at least for the stopping criterion max. edge weight, but maybe also for f1 score and accuracy,....
new mini release highlighting bug fixes and adaption to large (120k x 28k) & complex (342 groups of interest/labels) data - [ ] #36 - [ ] #37 - [...
Significance analysis for clustering with single-cell RNA-sequencing data https://www.nature.com/articles/s41592-023-01933-9
- Current implementation (clusterCrit) is fast on it's own but does not reuse distance matrices that could be determined only once. - Only euclidean metric is supported, extension to support...
- consider Variation of Information (VI) and Split/Join: https://stats.stackexchange.com/questions/24961/comparing-clusterings-rand-index-vs-variation-of-information
https://scikit-learn.org/stable/modules/generated/sklearn.manifold.trustworthiness.html#sklearn.manifold.trustworthiness determine (if computational feasible) trustworthiness for every embedding and provide it in the results