rapids-single-cell-examples OverflowError: value too large to convert to int

Could I ask if you might have any tips on how to overcome this error?

I'm running your 1M cell code, but I tried it on my own set of 2.8M cells.

Here's my matrix:

sparse_gpu_array.shape
# (2886934, 33567)

sparse_gpu_array.nnz
# 4128695018

Let's try to run this:

sparse_gpu_array, genes = rapids_scanpy_funcs.filter_genes(sparse_gpu_array, genes, min_cells=1000)

---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
<timed exec> in <module>

~/work/github.com/slowkow/rapids-single-cell-examples/notebooks/rapids_scanpy_funcs.py in filter_genes(sparse_gpu_array, genes_idx, min_cells)
    269         Genes containing a number of cells below this value will be filtered
    270     """
--> 271     thr = np.asarray(sparse_gpu_array.sum(axis=0) >= min_cells).ravel()
    272     filtered_genes = cp.sparse.csr_matrix(sparse_gpu_array[:, thr])
    273     genes_idx = genes_idx[np.where(thr)[0]]

~/.conda/envs/rapidgenomics/lib/python3.7/site-packages/cupyx/scipy/sparse/base.py in sum(self, axis, dtype, out)
    388 
    389         if axis == 0:
--> 390             ret = self.T.dot(cupy.ones(m, dtype=self.dtype)).reshape(1, n)
    391         else:  # axis == 1
    392             ret = self.dot(cupy.ones(n, dtype=self.dtype)).reshape(m, 1)

~/.conda/envs/rapidgenomics/lib/python3.7/site-packages/cupyx/scipy/sparse/base.py in dot(self, other)
    307     def dot(self, other):
    308         """Ordinary dot product"""
--> 309         return self * other
    310 
    311     def getH(self):

~/.conda/envs/rapidgenomics/lib/python3.7/site-packages/cupyx/scipy/sparse/csc.py in __mul__(self, other)
    111                 return self._with_data(self.data * other)
    112             elif other.ndim == 1:
--> 113                 self.sum_duplicates()
    114                 if cusparse.check_availability('csrmv'):
    115                     csrmv = cusparse.csrmv

~/.conda/envs/rapidgenomics/lib/python3.7/site-packages/cupyx/scipy/sparse/compressed.py in sum_duplicates(self)
    333             self._has_canonical_format = True
    334             return
--> 335         coo = self.tocoo()
    336         coo.sum_duplicates()
    337         self.__init__(coo.asformat(self.format))

~/.conda/envs/rapidgenomics/lib/python3.7/site-packages/cupyx/scipy/sparse/csc.py in tocoo(self, copy)
    214 
    215         """
--> 216         return self.T.tocoo(copy).T
    217 
    218     def tocsc(self, copy=None):

~/.conda/envs/rapidgenomics/lib/python3.7/site-packages/cupyx/scipy/sparse/csr.py in tocoo(self, copy)
    268             indices = self.indices
    269 
--> 270         return cusparse.csr2coo(self, data, indices)
    271 
    272     def tocsc(self, copy=False):

~/.conda/envs/rapidgenomics/lib/python3.7/site-packages/cupy/cusparse.py in csr2coo(x, data, indices)
    900     cusparse.xcsr2coo(
    901         handle, x.indptr.data.ptr, nnz, m, row.data.ptr,
--> 902         cusparse.CUSPARSE_INDEX_BASE_ZERO)
    903     # data and indices did not need to be copied already
    904     return cupyx.scipy.sparse.coo_matrix(

cupy/cuda/cusparse.pyx in cupy.cuda.cusparse.xcsr2coo()

OverflowError: value too large to convert to int

Nov 02 '20 21:11 slowkow

Hi @slowkow,

It looks like this issue may have been addressed already in cupy/cupy#4223. We are running into similar problems as we work through upcoming changes to use Cupy 8.0 and put more of the filtering logic on the GPU device.

An option for us to get around the size limitation in the gene filtering step might be to allocate an empty 1-d output array of size n_cells and then perform the sum over a few batches. Take the following as an example to populate the summed array with the sums across the genes for the first 100 cells:

summed_gpu_array = cp.empty(sparse_gpu_array.shape[0], dtype=cp.float32)
summed_gpu_array[0:100] = sparse_gpu_array[0:100].sum(axis=0)

Nov 04 '20 19:11 cjnolet

Corey, thanks for the reply! If I eventually get back to this error, I might try to modify your function filter_genes() to perform a sum over multiple batches and see if the code runs from that point onward.

Could I please ask if you have successfully run the RAPIDS analysis on a real dataset that is larger than the 1M cell dataset?

Nov 04 '20 20:11 slowkow

rapids-single-cell-examples rapids-single-cell-examples copied to clipboard

OverflowError: value too large to convert to int

rapids-single-cell-examples
rapids-single-cell-examples copied to clipboard