scvi-tools Fix solo bug with large batches

Hello everyone, this is my first time contributing here, so let me know if I got something wrong with the way I approached this.

I have a very large batch in an AnnData object. This batch has 803283 cells (and there are 10811 genes in the AnnData).

After training an scVI model, I ran the following: solo = scvi.external.SOLO.from_scvi_model(scvi_model, restrict_to_batch="<batch>") and got this error:

ValueError                                Traceback (most recent call last)
Cell In[109], line 1
----> 1 doublets = x[parent_inds[:, 0]] + x[parent_inds[:, 1]]

File ~/miniconda3/envs/scvi/lib/python3.12/site-packages/scipy/sparse/_csr.py:24, in _csr_base.__getitem__(self, key)
     22 def __getitem__(self, key):
     23     if self.ndim == 2:
---> 24         return super().__getitem__(key)
     26     if isinstance(key, tuple) and len(key) == 1:
     27         key = key[0]

File ~/miniconda3/envs/scvi/lib/python3.12/site-packages/scipy/sparse/_index.py:83, in IndexMixin.__getitem__(self, key)
     81         return self._get_arrayXint(row, col)
     82     elif isinstance(col, slice):
---> 83         return self._get_arrayXslice(row, col)
     84 else:  # row.ndim == 2
     85     if isinstance(col, INT_TYPES):

File ~/miniconda3/envs/scvi/lib/python3.12/site-packages/scipy/sparse/_csr.py:277, in _csr_base._get_arrayXslice(self, row, col)
    275     col = np.arange(*col.indices(self.shape[1]))
    276     return self._get_arrayXarray(row, col)
--> 277 return self._major_index_fancy(row)._get_submatrix(minor=col)

File ~/miniconda3/envs/scvi/lib/python3.12/site-packages/scipy/sparse/_compressed.py:765, in _cs_matrix._major_index_fancy(self, idx)
    762 np.cumsum(row_nnz, out=res_indptr[1:])
    764 nnz = res_indptr[-1]
--> 765 res_indices = np.empty(nnz, dtype=idx_dtype)
    766 res_data = np.empty(nnz, dtype=self.dtype)
    767 csr_row_index(
    768     M,
    769     indices,
   (...)
    774     res_data
    775 )

ValueError: negative dimensions are not allowed

So I started digging into this and found the following:

When this line is executed AND either x[parent_inds[:, 0]] or x[parent_inds[:, 1]] create a sparse matrix that has more than int32MAX stored values, the above error occurs
When creating slices directly from AnnData.X, this restriction somehow appears to vanish, which I have not found a technical reason for - so I concluded that slicing once (x = x[batch_indices]) somehow introduces this int32MAX size limit, which comes into effect when performing x[parent_inds[:, 0]].
My solution changes the code, so that only a single slicing operation is performed. I think the code changes themselves do not need too much explanation.

If you have any idea why this happens, or if you have a better solution, please enlighten me :D

Aug 25 '24 17:08 nictru

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 85.31%. Comparing base (89d772a) to head (1ca6043). Report is 57 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2947      +/-   ##
==========================================
- Coverage   85.31%   85.31%   -0.01%     
==========================================
  Files         167      167              
  Lines       14385    14384       -1     
==========================================
- Hits        12272    12271       -1     
  Misses       2113     2113

Files with missing lines	Coverage Δ
src/scvi/external/solo/_model.py	`96.02% <100.00%> (-0.03%)`	:arrow_down:

Aug 26 '24 14:08 codecov[bot]

While I have experienced similar issues in other parts of the code, I don't get why the proposed change fixes the issue. Both the old and new code generate int64 indices in my hands. Is this also the case for you?

Aug 29 '24 04:08 canergen

Closing here as it's not reproducible in my hands. Test below.

import scvi
from scvi.data import synthetic_iid
from scvi.external import SOLO
import scanpy as sc
import anndata as ad
import numpy as np
import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix

m = int(5e7)
n_rows = int(1e6)  # Number of rows in the matrix
n_cols = int(1e4)  # Number of columns in the matrix

# Generate random indices for rows and columns
row_indices = np.random.randint(low=0, high=n_rows, size=m, dtype=np.int64)
col_indices = np.random.randint(low=0, high=n_cols, size=m, dtype=np.int64)

# Fill with ones
data = np.ones(m, dtype=np.float32)
sorted_indices = np.argsort(row_indices)
sorted_row_indices = row_indices[sorted_indices]
sorted_col_indices = col_indices[sorted_indices]
# Initialize indptr with zeros; it needs to have n_rows + 1 elements
indptr = np.zeros(n_rows + 1, dtype=np.int64)

# Populate indptr by counting the occurrences of each row index
np.cumsum(np.bincount(sorted_row_indices, minlength=n_rows), out=indptr[1:])
large_csr_matrix = csr_matrix((data, sorted_col_indices, indptr), shape=(n_rows, n_cols))

print("Matrix shape:", large_csr_matrix.shape)
print("Number of stored elements:", large_csr_matrix.nnz)

adata = ad.AnnData(X=large_csr_matrix)
adata = ad.concat([adata,adata], label='batch')

scvi.model.SCVI.setup_anndata(adata, batch_key="batch")
scvi_model = scvi.model.SCVI(adata)
scvi_model.train(max_epochs=10)

solo = SOLO.from_scvi_model(scvi_model, restrict_to_batch="0")
solo.train(max_epochs=10)

Sep 09 '24 17:09 canergen