Downcast `indices` for sparse matrices if possible on-disk
Please describe your wishes and possible alternatives to achieve the desired result.
Long standing issue @felix0097 raised a while ago, but we should begin a spec "change" process to begin to allow writing out indices of a CSR matrix (or CSC, although less valuable there) whose max value is checked + dtype automatically set optimally. Nowhere are dtypes specified in https://anndata.readthedocs.io/en/latest/fileformat-prose.html#sparse-array-specification-v0-1-0 so while this isn't a breaking change, it could potentially complicate things downstream.
In other words, the values in indices of a CSR matrix are regularly less than max(uint16) (because we often dont have more than 30000 or so genes) but are often written as {u}int32/64 so allowing users to write data optimized for this fact without breaking downstream pipelines is in our interest. The process for this would be
- Add a setting to allow this behavior via
anndata.settings.downcast_indices_in_sparse = Trueor similar - I would guess the behavior would be "take the max of the incomingindicesoptionally and then write out as the minimum needed dtype" - Write tests within
anndatathat ensure reading this data back in doesn't break cupy/scipy sparse - Release with this setting as
False - Potentially in the future set to
True
The downside/complication has been traditionally scipy sparse handling of differing indptr + indices dtypes but I think that is a manageable problem if we limit ourselves to just io here, and I imagine the performance would be better, even if the data has to be re-upcast into int32 for this compatibility issue, given the lessened io.
cc @lazappi @ivirshup @keller-mark, happy to produce some dummy data for this for y'all to test. Let me know if you are aware of any downsides here or if you have commments!
For {anndataR}, I think reading would be ok and we would need to decide whether to implement something similar for writing. Having some test data to confirm would be good though.
What is the main motivation for this, just saving space on disk? I'm wondering if there would be enough of an improvement to justify the extra complexity.
What is the main motivation for this, just saving space on disk? I'm wondering if there would be enough of an improvement to justify the extra complexity.
Yup, aparse matrices would be cut down in size by ~a factor of 2~ ~40% on disk or so, so io would also be faster at the cost of the potential reallocation of memory. I think the tradeoff is worth it, but also there are other sparse libraries out there that aren't scipy would could in theory handle the differing data types in which case there is no tradeoff, just saved io time.
I have no problem with this from my end, as the reduced array size should be beneficial when loading data via the network and for reduced memory usage in the browser.
Also cc @kaizhang
Hi all,
Here are some test anndata files, have a look at the file names for the case of each file - basically, I generated files for the smallest, middle, and largest values for which uint{8,16} would be used under this optimization and then created tests to ensure the dtype is the "smallest" - I also have a smallest for uint32 but anything larger (or the largest possible value) would blow up memory for writing an anndata object, although I'm happy to generate just the matrix:
So let me know if you want, for example, a CSR matrix with a billion columns whose indices would then be uint32 instead of {u}int64.
ping everyone!
I tried reading a few of the examples with {anndataR} and it seemed to work without any issues so it should be fine from that side. We would need to implement something similar for writing but that could come later.
it seemed to work without any issues
Awesome thanks! No rush on writing, just wanted to make sure reading worked. I'm being extra cautious about file changes
From the call yesterday, I learned that R only has one kind of integer. So they are fine. I think @kaizhang Rust is my bigger concern. I imagine you have strict types on what kind of data is expected to be on-disk, but maybe not?
anndata-rs seems to handle only int32 and int64: https://github.com/kaizhang/anndata-rs/blob/df51c561e1efc453696fb2ae6c1a567e812f73a2/pyanndata/src/data/array.rs#L90
but that should be quick to remedy: https://github.com/kaizhang/anndata-rs/pull/18
@flying-sheep Thanks and yes, this should be easy to do in anndata-rs. I'll add this feature to main after this feature is merged into python anndata.