scanpy icon indicating copy to clipboard operation
scanpy copied to clipboard

Speedup (~20x) of scanpy.pp.regress_out function using Linear Least Square method.

Open ashish615 opened this issue 1 year ago • 11 comments

  • [ ] Closes #
  • [ ] Tests included or not required because:
  • [ ] Release notes not necessary because:

Hi, We are submitting PR for speed up of the regress_out function. Here we finding coefficient using Linear regression (Linear Least Squares) rather then GLM for non categorical data.

Time(sec)
Original 297
Updated 14.91
Speedup 19.91

experiment setup : AWS r7i.24xlarge

import time
import numpy as np

import pandas as pd

import scanpy as sc
from sklearn.cluster import KMeans

import os
import wget

import warnings



warnings.filterwarnings('ignore', 'Expected ')
warnings.simplefilter('ignore')
input_file = "./1M_brain_cells_10X.sparse.h5ad"

if not os.path.exists(input_file):
    print('Downloading import file...')
    wget.download('https://rapids-single-cell-examples.s3.us-east-2.amazonaws.com/1M_brain_cells_10X.sparse.h5ad',input_file)


# marker genes
MITO_GENE_PREFIX = "mt-" # Prefix for mitochondrial genes to regress out
markers = ["Stmn2", "Hes1", "Olig1"] # Marker genes for visualization

# filtering cells
min_genes_per_cell = 200 # Filter out cells with fewer genes than this expressed
max_genes_per_cell = 6000 # Filter out cells with more genes than this expressed

# filtering genes
min_cells_per_gene = 1 # Filter out genes expressed in fewer cells than this
n_top_genes = 4000 # Number of highly variable genes to retain

# PCA
n_components = 50 # Number of principal components to compute

# t-SNE
tsne_n_pcs = 20 # Number of principal components to use for t-SNE

# k-means
k = 35 # Number of clusters for k-means

# Gene ranking

ranking_n_top_genes = 50 # Number of differential genes to compute for each cluster

# Number of parallel jobs
sc._settings.ScanpyConfig.n_jobs = os.cpu_count()

start=time.time()
tr=time.time()
adata = sc.read(input_file)
adata.var_names_make_unique()
adata.shape
print("Total read time : %s" % (time.time()-tr))



tr=time.time()
# To reduce the number of cells:
USE_FIRST_N_CELLS = 1300000
adata = adata[0:USE_FIRST_N_CELLS]
adata.shape

sc.pp.filter_cells(adata, min_genes=min_genes_per_cell)
sc.pp.filter_cells(adata, max_genes=max_genes_per_cell)
sc.pp.filter_genes(adata, min_cells=min_cells_per_gene)
sc.pp.normalize_total(adata, target_sum=1e4)
print("Total filter and normalize time : %s" % (time.time()-tr))


tr=time.time()
sc.pp.log1p(adata)
print("Total log time : %s" % (time.time()-tr))


# Select highly variable genes
sc.pp.highly_variable_genes(adata, n_top_genes=n_top_genes, flavor = "cell_ranger")

# Retain marker gene expression
for marker in markers:
        adata.obs[marker + "_raw"] = adata.X[:, adata.var.index == marker].toarray().ravel()

# Filter matrix to only variable genes
adata = adata[:, adata.var.highly_variable]

#Regress out confounding factors (number of counts, mitochondrial gene expression)
mito_genes = adata.var_names.str.startswith(MITO_GENE_PREFIX)
n_counts = np.array(adata.X.sum(axis=1))
adata.obs['percent_mito'] = np.array(np.sum(adata[:, mito_genes].X, axis=1)) / n_counts
adata.obs['n_counts'] = n_counts

ts=time.time()
sc.pp.regress_out(adata, ['n_counts', 'percent_mito'])
print("Total regress out time : %s" % (time.time()-ts))

ashish615 avatar Jun 18 '24 06:06 ashish615

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 76.39%. Comparing base (ad657ed) to head (5e4df8c). Report is 129 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3110      +/-   ##
==========================================
+ Coverage   76.31%   76.39%   +0.08%     
==========================================
  Files         109      109              
  Lines       12513    12526      +13     
==========================================
+ Hits         9549     9569      +20     
+ Misses       2964     2957       -7     
Files with missing lines Coverage Δ
src/scanpy/preprocessing/_simple.py 88.10% <100.00%> (+0.15%) :arrow_up:

... and 1 file with indirect coverage changes

codecov[bot] avatar Jun 18 '24 06:06 codecov[bot]

Thanks for keeping them coming in! It would be helpful if you documented a bit why your newly introduced branch is faster with comments.

flying-sheep avatar Jun 18 '24 11:06 flying-sheep

Benchmark changes

Change Before [ad657edf] After [3e3ca9dc] Ratio Benchmark (Parameter)
- 404M 339M 0.84 preprocessing_log.peakmem_pca('pbmc68k_reduced')
- 1.61±0.01s 11.6±0.4ms 0.01 preprocessing_log.time_regress_out('pbmc68k_reduced')

Comparison: https://github.com/scverse/scanpy/compare/ad657edfb52e9957b9a93b3a16fc8a87852f3f09..3e3ca9dcb2e77e72b75095ff895cb55aeb7f98bc Last changed:

More details: https://github.com/scverse/scanpy/pull/3110/checks?check_run_id=26405245612

scverse-benchmark[bot] avatar Jun 18 '24 11:06 scverse-benchmark[bot]

ooh, this time the benchmark shows really nicely how much faster it is!

Ratio: 0.01 Benchmark: preprocessing_log.time_regress_out('pbmc68k_reduced')

flying-sheep avatar Jun 18 '24 12:06 flying-sheep

ooh, this time the benchmark shows really nicely how much faster it is!

Looks like preprocessing_log.time_regress_out('pbmc68k_reduced') , regress out those variables that is not inside it. It should regress_out ['n_counts', 'percent_mito'] instead of ["total_counts", "pct_counts_mt"]. For the both commit it fails so report the same time.

ashish615 avatar Jun 19 '24 06:06 ashish615

It looks really good. I would love to have a closer look next week

Intron7 avatar Jun 20 '24 14:06 Intron7

ooh, this time the benchmark shows really nicely how much faster it is!

Looks like preprocessing_log.time_regress_out('pbmc68k_reduced') , regress out those variables that is not inside it. It should regress_out ['n_counts', 'percent_mito'] instead of ["total_counts", "pct_counts_mt"]. For the both commit it fails so report the same time.

@flying-sheep , can you look at the this benchmark test?

ashish615 avatar Jun 20 '24 16:06 ashish615

Hi, everything is OK with the benchmarks, regress_out would fail if called with variables that doesn’t exist.

The reason these are named differently is here: https://github.com/scverse/scanpy/blob/ad657edfb52e9957b9a93b3a16fc8a87852f3f09/benchmarks/benchmarks/_utils.py#L27-L31

I did that to be able to run benchmarks benchmarks on multiple data sets with the same code.

flying-sheep avatar Jun 24 '24 08:06 flying-sheep

The to_dense function only works for csr matrices. I think we need another kernel that handles csc or just .T the resulting array if csc. I also get performance warnings about the matmul in the numba_kernel. I'll investigate this further.

Intron7 avatar Jul 03 '24 11:07 Intron7

This will work for csc matrix

@numba.njit(cache=True, parallel=True)
def to_dense_csc(
    shape: tuple[int, int],
    indptr: NDArray[np.integer],
    indices: NDArray[np.integer],
    data: NDArray[DT],
) -> NDArray[DT]:
    """\
    Numba kernel for np.toarray() function
    """
    X = np.empty(shape, dtype=data.dtype)

    for c in numba.prange(shape[1]):
        X[:,c] = 0
        for i in range(indptr[c], indptr[c + 1]):
            X[indices[i],c] = data[i]
    return X

ashish615 avatar Jul 03 '24 11:07 ashish615

That kernel technically works but is way slower than the default scipy operation due to inefficient memory layout.

Intron7 avatar Jul 03 '24 12:07 Intron7