anndata
anndata copied to clipboard
Splicing columns with sparse matrices with on-disk formats (e.g. h5py)
I've been trying to understand how splicing works on ondisk storage. I currently have a csc sparse matrix where row are genes (~30000) and column are cells (~2 million).
Suppose that I selected randomly chosen genes from the data.
idx = np.random.choice((True, False), p=[0.1, 0.9], size=mtx.shape[1])
Then loading approximately the same amount of data has different timings.
%timeit adata.X[:,idx]
## 2.88 s ± 3.94 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit adata.X[:,:idx.sum()]
## 1.64 s ± 3.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I suspect the reason for this is due to sequential reading of data from the disk. The former has to read columns apart and the latter reads continuously ordered data. I think some smart chunking can handle this but currently have no idea how chunking can work on sparse formats.
So, how does anndata ondisk sparse format work and why does this difference occur?
- The data is saved in a h5py format.
There are a couple reasons this could be slower, multiple of which may be in play here.
I suspect the reason for this is due to sequential reading of data from the disk.
Yes. Two reasons for this:
- Reading in fewer unique chunks total
- Access is always ordered with slices (though this should also be the case with the boolean mask)
So, how does anndata ondisk sparse format work
It's essentially the same as the in memory sparse formats, and currently uses much of the same code. You'll have separate indptr
, indices
, and data
arrays. The indptr
array will be fully read into memory, which will then be used to read contiguous slices out of the indices
and data
arrays.
What kind of chunking strategy were you thinking of?
This issue has been automatically marked as stale because it has not had recent activity. Please add a comment if you want to keep the issue open. Thank you for your contributions!
I’m going to close this for now. Feel free to respond if you want to narrow this down, @hanbin973