Fix solo bug with large batches
Hello everyone, this is my first time contributing here, so let me know if I got something wrong with the way I approached this.
I have a very large batch in an AnnData object. This batch has 803283 cells (and there are 10811 genes in the AnnData).
After training an scVI model, I ran the following:
solo = scvi.external.SOLO.from_scvi_model(scvi_model, restrict_to_batch="<batch>")
and got this error:
ValueError Traceback (most recent call last)
Cell In[109], line 1
----> 1 doublets = x[parent_inds[:, 0]] + x[parent_inds[:, 1]]
File ~/miniconda3/envs/scvi/lib/python3.12/site-packages/scipy/sparse/_csr.py:24, in _csr_base.__getitem__(self, key)
22 def __getitem__(self, key):
23 if self.ndim == 2:
---> 24 return super().__getitem__(key)
26 if isinstance(key, tuple) and len(key) == 1:
27 key = key[0]
File ~/miniconda3/envs/scvi/lib/python3.12/site-packages/scipy/sparse/_index.py:83, in IndexMixin.__getitem__(self, key)
81 return self._get_arrayXint(row, col)
82 elif isinstance(col, slice):
---> 83 return self._get_arrayXslice(row, col)
84 else: # row.ndim == 2
85 if isinstance(col, INT_TYPES):
File ~/miniconda3/envs/scvi/lib/python3.12/site-packages/scipy/sparse/_csr.py:277, in _csr_base._get_arrayXslice(self, row, col)
275 col = np.arange(*col.indices(self.shape[1]))
276 return self._get_arrayXarray(row, col)
--> 277 return self._major_index_fancy(row)._get_submatrix(minor=col)
File ~/miniconda3/envs/scvi/lib/python3.12/site-packages/scipy/sparse/_compressed.py:765, in _cs_matrix._major_index_fancy(self, idx)
762 np.cumsum(row_nnz, out=res_indptr[1:])
764 nnz = res_indptr[-1]
--> 765 res_indices = np.empty(nnz, dtype=idx_dtype)
766 res_data = np.empty(nnz, dtype=self.dtype)
767 csr_row_index(
768 M,
769 indices,
(...)
774 res_data
775 )
ValueError: negative dimensions are not allowed
So I started digging into this and found the following:
- When this line is executed AND either
x[parent_inds[:, 0]]orx[parent_inds[:, 1]]create a sparse matrix that has more than int32MAX stored values, the above error occurs - When creating slices directly from
AnnData.X, this restriction somehow appears to vanish, which I have not found a technical reason for - so I concluded that slicing once (x = x[batch_indices]) somehow introduces this int32MAX size limit, which comes into effect when performingx[parent_inds[:, 0]]. - My solution changes the code, so that only a single slicing operation is performed. I think the code changes themselves do not need too much explanation.
If you have any idea why this happens, or if you have a better solution, please enlighten me :D
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 85.31%. Comparing base (
89d772a) to head (1ca6043). Report is 57 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #2947 +/- ##
==========================================
- Coverage 85.31% 85.31% -0.01%
==========================================
Files 167 167
Lines 14385 14384 -1
==========================================
- Hits 12272 12271 -1
Misses 2113 2113
| Files with missing lines | Coverage Δ | |
|---|---|---|
| src/scvi/external/solo/_model.py | 96.02% <100.00%> (-0.03%) |
:arrow_down: |
While I have experienced similar issues in other parts of the code, I don't get why the proposed change fixes the issue. Both the old and new code generate int64 indices in my hands. Is this also the case for you?
Closing here as it's not reproducible in my hands. Test below.
import scvi
from scvi.data import synthetic_iid
from scvi.external import SOLO
import scanpy as sc
import anndata as ad
import numpy as np
import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix
m = int(5e7)
n_rows = int(1e6) # Number of rows in the matrix
n_cols = int(1e4) # Number of columns in the matrix
# Generate random indices for rows and columns
row_indices = np.random.randint(low=0, high=n_rows, size=m, dtype=np.int64)
col_indices = np.random.randint(low=0, high=n_cols, size=m, dtype=np.int64)
# Fill with ones
data = np.ones(m, dtype=np.float32)
sorted_indices = np.argsort(row_indices)
sorted_row_indices = row_indices[sorted_indices]
sorted_col_indices = col_indices[sorted_indices]
# Initialize indptr with zeros; it needs to have n_rows + 1 elements
indptr = np.zeros(n_rows + 1, dtype=np.int64)
# Populate indptr by counting the occurrences of each row index
np.cumsum(np.bincount(sorted_row_indices, minlength=n_rows), out=indptr[1:])
large_csr_matrix = csr_matrix((data, sorted_col_indices, indptr), shape=(n_rows, n_cols))
print("Matrix shape:", large_csr_matrix.shape)
print("Number of stored elements:", large_csr_matrix.nnz)
adata = ad.AnnData(X=large_csr_matrix)
adata = ad.concat([adata,adata], label='batch')
scvi.model.SCVI.setup_anndata(adata, batch_key="batch")
scvi_model = scvi.model.SCVI(adata)
scvi_model.train(max_epochs=10)
solo = SOLO.from_scvi_model(scvi_model, restrict_to_batch="0")
solo.train(max_epochs=10)