anndata icon indicating copy to clipboard operation
anndata copied to clipboard

Splicing columns with sparse matrices with on-disk formats (e.g. h5py)

Open hanbin973 opened this issue 3 years ago • 1 comments

I've been trying to understand how splicing works on ondisk storage. I currently have a csc sparse matrix where row are genes (~30000) and column are cells (~2 million).

Suppose that I selected randomly chosen genes from the data.

idx = np.random.choice((True, False), p=[0.1, 0.9], size=mtx.shape[1])

Then loading approximately the same amount of data has different timings.

%timeit adata.X[:,idx]
## 2.88 s ± 3.94 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit adata.X[:,:idx.sum()]
## 1.64 s ± 3.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I suspect the reason for this is due to sequential reading of data from the disk. The former has to read columns apart and the latter reads continuously ordered data. I think some smart chunking can handle this but currently have no idea how chunking can work on sparse formats.

So, how does anndata ondisk sparse format work and why does this difference occur?

  • The data is saved in a h5py format.

hanbin973 avatar Nov 25 '21 23:11 hanbin973

There are a couple reasons this could be slower, multiple of which may be in play here.

I suspect the reason for this is due to sequential reading of data from the disk.

Yes. Two reasons for this:

  • Reading in fewer unique chunks total
  • Access is always ordered with slices (though this should also be the case with the boolean mask)

So, how does anndata ondisk sparse format work

It's essentially the same as the in memory sparse formats, and currently uses much of the same code. You'll have separate indptr, indices, and data arrays. The indptr array will be fully read into memory, which will then be used to read contiguous slices out of the indices and data arrays.

What kind of chunking strategy were you thinking of?

ivirshup avatar Nov 29 '21 15:11 ivirshup

This issue has been automatically marked as stale because it has not had recent activity. Please add a comment if you want to keep the issue open. Thank you for your contributions!

github-actions[bot] avatar Jun 23 '23 02:06 github-actions[bot]

I’m going to close this for now. Feel free to respond if you want to narrow this down, @hanbin973

flying-sheep avatar Jun 23 '23 09:06 flying-sheep