t-SNE optimization using scikit-learn-intelex
This pull request accelerates t-SNE using the scikit-learn-intelex library, resulting in approximately a 10x runtime improvement for the t-SNE implementation in the package for the given example below. The experiment was run on AWS r7i.24xlarge.
import time
import numpy as np
import pandas as pd
import scanpy as sc
from sklearn.cluster import KMeans
import os
import wget
import warnings
warnings.filterwarnings('ignore', 'Expected ')
warnings.simplefilter('ignore')
input_file = "./1M_brain_cells_10X.sparse.h5ad"
if not os.path.exists(input_file):
print('Downloading import file...')
wget.download('https://rapids-single-cell-examples.s3.us-east-2.amazonaws.com/1M_brain_cells_10X.sparse.h5ad',input_file)
# marker genes
MITO_GENE_PREFIX = "mt-" # Prefix for mitochondrial genes to regress out
markers = ["Stmn2", "Hes1", "Olig1"] # Marker genes for visualization
# filtering cells
min_genes_per_cell = 200 # Filter out cells with fewer genes than this expressed
max_genes_per_cell = 6000 # Filter out cells with more genes than this expressed
# filtering genes
min_cells_per_gene = 1 # Filter out genes expressed in fewer cells than this
n_top_genes = 4000 # Number of highly variable genes to retain
# PCA
n_components = 50 # Number of principal components to compute
# t-SNE
tsne_n_pcs = 20 # Number of principal components to use for t-SNE
# k-means
k = 35 # Number of clusters for k-means
# Gene ranking
ranking_n_top_genes = 50 # Number of differential genes to compute for each cluster
# Number of parallel jobs
sc._settings.ScanpyConfig.n_jobs = os.cpu_count()
start=time.time()
tr=time.time()
adata = sc.read(input_file)
adata.var_names_make_unique()
adata.shape
print("Total read time : %s" % (time.time()-tr))
tr=time.time()
# To reduce the number of cells:
USE_FIRST_N_CELLS = 1300000
adata = adata[0:USE_FIRST_N_CELLS]
adata.shape
sc.pp.filter_cells(adata, min_genes=min_genes_per_cell)
sc.pp.filter_cells(adata, max_genes=max_genes_per_cell)
sc.pp.filter_genes(adata, min_cells=min_cells_per_gene)
sc.pp.normalize_total(adata, target_sum=1e4)
print("Total filter and normalize time : %s" % (time.time()-tr))
tr=time.time()
sc.pp.log1p(adata)
print("Total log time : %s" % (time.time()-tr))
# Select highly variable genes
sc.pp.highly_variable_genes(adata, n_top_genes=n_top_genes, flavor = "cell_ranger")
# Retain marker gene expression
for marker in markers:
adata.obs[marker + "_raw"] = adata.X[:, adata.var.index == marker].toarray().ravel()
# Filter matrix to only variable genes
adata = adata[:, adata.var.highly_variable]
ts=time.time()
#Regress out confounding factors (number of counts, mitochondrial gene expression)
mito_genes = adata.var_names.str.startswith(MITO_GENE_PREFIX)
n_counts = np.array(adata.X.sum(axis=1))
adata.obs['percent_mito'] = np.array(np.sum(adata[:, mito_genes].X, axis=1)) / n_counts
adata.obs['n_counts'] = n_counts
sc.pp.regress_out(adata, ['n_counts', 'percent_mito'])
print("Total regress out time : %s" % (time.time()-ts))
#scale
ts=time.time()
sc.pp.scale(adata, max_value=10)
print("Total scale time : %s" % (time.time()-ts))
t0=time.time()
sc.tl.pca(adata, n_comps=n_components)
t0=time.time()
sc._settings.ScanpyConfig.n_jobs = os.cpu_count()
sc.tl.tsne(adata, n_pcs=tsne_n_pcs, use_fast_tsne=True)
'''
from sklearn.manifold import TSNE
from scanpy.tools._utils import _choose_representation
X = _choose_representation(adata, n_pcs=tsne_n_pcs)
X_tsne = TSNE().fit_transform(X.astype(np.float32))
adata.obsm['X_tsne'] = X_tsne
'''
print("Tsne time:", time.time()-t0)
Codecov Report
Attention: Patch coverage is 66.66667% with 1 lines in your changes are missing coverage. Please review.
Project coverage is 75.19%. Comparing base (
3ba3f46) to head (d8d4763).
Additional details and impacted files
@@ Coverage Diff @@
## main #3061 +/- ##
==========================================
- Coverage 75.87% 75.19% -0.68%
==========================================
Files 110 110
Lines 12533 12536 +3
==========================================
- Hits 9509 9427 -82
- Misses 3024 3109 +85
| Files | Coverage Δ | |
|---|---|---|
| scanpy/tools/_tsne.py | 68.08% <66.66%> (-25.10%) |
:arrow_down: |
For the above code, the time spent in tSNE went down from 2252 secs to 210 secs due to this PR.
For the above code, the time spent in tSNE went down from 2252 secs to 210 secs due to this PR.
What’s the comparison to MulticoreTSNE?
Defaults
It would probably make sense to use a flavor: Literal['auto', 'sklearn', 'intelex', 'multicore'] = 'auto' parameter here, where auto would try to import the speedup packages one-by-one and use the preferred one.
use_fast_tsne could be deprecated and made to default to None, with this logic (too bad we can’t use match yet)
if use_fast_tsne is not None:
warnings.warn("...", FutureWarning)
match (use_fast_tsne, flavor):
case (None, 'auto'): ... # try importing 'intelex', fall back to 'sklearn'
case (None, _): ... # use specified flavor
case (True, 'auto'): ... # use 'multicore'
case (True, 'sklearn'): ... # throw error
case (True, _): ... # use specified flavor
case (False, 'auto' | 'sklearn'): ... # Use 'sklearn'
case (False, _): ... # Throw error
case _: ... raise AssertionError()
In the future, we can change 'auto' to try both intelex and MulticoreTSNE
@ilan-gold what do you think?
This seems reasonable to me @flying-sheep . Does using the patched version change results over the unpatched @ashish615 i.e., for a given random seed, unpatched and patched are the same? If the two are the same for a given seed/state, then I think what @flying-sheep is proposing could be done separately (even if we make the dependency optional IMO). However, if the new version does change results, we will need the handling that @flying-sheep describes.
What’s the comparison to MulticoreTSNE?
This is the compare is while using use_fast_tsne flag which using MulticoreTSNE.
This seems reasonable to me @flying-sheep . Does using the patched version change results over the unpatched @ashish615 i.e., for a given random seed, unpatched and patched are the same? If the two are the same for a given seed/state, then I think what @flying-sheep is proposing could be done separately (even if we make the dependency optional IMO). However, if the new version does change results, we will need the handling that @flying-sheep describes.
@ilan-gold , I didn't check that. I will check that and let you guys know.
These t-SNE optimizations are mentioned in the following paper. Adding it here for reference. https://arxiv.org/abs/2212.11506 Accelerating Barnes-Hut t-SNE Algorithm by Efficient Parallelization on Multi-Core CPUs N Chaudhary, A Pivovar, P Yakovlev, A Gorshkov… - arXiv preprint arXiv:2212.11506, 2022
@narendrachaudhary51 You can add the reference here https://github.com/scverse/scanpy/blob/main/docs/references.bib (thank you @flying-sheep for improving this!)
Hi @flying-sheep @ilan-gold , Based on our previous discussion, we observed that applying and then removing a patch while fixing the seed causes the t-SNE output to change. In our experiment, we used 1.3 million data points to run t-SNE and compared the results of the patched and unpatched versions by examining the KL Divergence from both runs. The results are summarized in the table below.
In the above code use USE_FIRST_N_CELLS to set number of records and use sc.tl.tsne(adata, n_pcs=tsne_n_pcs, use_fast_tsne=False) to run optimized run with latest commit. You can get KL divergence numbers by logging kl_divergence_
As can be seen with the KL divergence values in the above table, while the output of Intel optimized t-SNE is different, it is equivalent in quality.
@sanchit-misra @ashish615 one thing that we thought - why does this need to be in scanpy? Can't users turn this on/off outside scanpy?
Notwithstanding the above question, since your method produces different results than the default, we will need to adopt the conventions outlined: https://github.com/scverse/scanpy/pull/3061#issuecomment-2114783668
This will ensure continued reproducibility with defaults.
Hi @ilan-gold,
Regarding your thought in this comment, we can enable or disable Intel optimization from outside the code. However, users might not be aware of how to use this feature. Instead, if we add it to scanpy directly, all scanpy users will know the same option available.
If we agree with the option discussed in this comment, I can proceed with updating the t-SNE file.