anndata icon indicating copy to clipboard operation
anndata copied to clipboard

Downcast `indices` for sparse matrices if possible on-disk

Open ilan-gold opened this issue 2 months ago • 11 comments

Please describe your wishes and possible alternatives to achieve the desired result.

Long standing issue @felix0097 raised a while ago, but we should begin a spec "change" process to begin to allow writing out indices of a CSR matrix (or CSC, although less valuable there) whose max value is checked + dtype automatically set optimally. Nowhere are dtypes specified in https://anndata.readthedocs.io/en/latest/fileformat-prose.html#sparse-array-specification-v0-1-0 so while this isn't a breaking change, it could potentially complicate things downstream.

In other words, the values in indices of a CSR matrix are regularly less than max(uint16) (because we often dont have more than 30000 or so genes) but are often written as {u}int32/64 so allowing users to write data optimized for this fact without breaking downstream pipelines is in our interest. The process for this would be

  1. Add a setting to allow this behavior via anndata.settings.downcast_indices_in_sparse = True or similar - I would guess the behavior would be "take the max of the incoming indices optionally and then write out as the minimum needed dtype"
  2. Write tests within anndata that ensure reading this data back in doesn't break cupy/scipy sparse
  3. Release with this setting as False
  4. Potentially in the future set to True

The downside/complication has been traditionally scipy sparse handling of differing indptr + indices dtypes but I think that is a manageable problem if we limit ourselves to just io here, and I imagine the performance would be better, even if the data has to be re-upcast into int32 for this compatibility issue, given the lessened io.

cc @lazappi @ivirshup @keller-mark, happy to produce some dummy data for this for y'all to test. Let me know if you are aware of any downsides here or if you have commments!

ilan-gold avatar Oct 15 '25 22:10 ilan-gold

For {anndataR}, I think reading would be ok and we would need to decide whether to implement something similar for writing. Having some test data to confirm would be good though.

What is the main motivation for this, just saving space on disk? I'm wondering if there would be enough of an improvement to justify the extra complexity.

lazappi avatar Oct 16 '25 05:10 lazappi

What is the main motivation for this, just saving space on disk? I'm wondering if there would be enough of an improvement to justify the extra complexity.

Yup, aparse matrices would be cut down in size by ~a factor of 2~ ~40% on disk or so, so io would also be faster at the cost of the potential reallocation of memory. I think the tradeoff is worth it, but also there are other sparse libraries out there that aren't scipy would could in theory handle the differing data types in which case there is no tradeoff, just saved io time.

ilan-gold avatar Oct 16 '25 06:10 ilan-gold

I have no problem with this from my end, as the reduced array size should be beneficial when loading data via the network and for reduced memory usage in the browser.

keller-mark avatar Oct 16 '25 07:10 keller-mark

Also cc @kaizhang

ilan-gold avatar Oct 20 '25 11:10 ilan-gold

Hi all,

Here are some test anndata files, have a look at the file names for the case of each file - basically, I generated files for the smallest, middle, and largest values for which uint{8,16} would be used under this optimization and then created tests to ensure the dtype is the "smallest" - I also have a smallest for uint32 but anything larger (or the largest possible value) would blow up memory for writing an anndata object, although I'm happy to generate just the matrix:

matrices.zip

So let me know if you want, for example, a CSR matrix with a billion columns whose indices would then be uint32 instead of {u}int64.

ilan-gold avatar Oct 20 '25 14:10 ilan-gold

ping everyone!

ilan-gold avatar Oct 28 '25 15:10 ilan-gold

I tried reading a few of the examples with {anndataR} and it seemed to work without any issues so it should be fine from that side. We would need to implement something similar for writing but that could come later.

lazappi avatar Oct 29 '25 07:10 lazappi

it seemed to work without any issues

Awesome thanks! No rush on writing, just wanted to make sure reading worked. I'm being extra cautious about file changes

ilan-gold avatar Oct 29 '25 08:10 ilan-gold

From the call yesterday, I learned that R only has one kind of integer. So they are fine. I think @kaizhang Rust is my bigger concern. I imagine you have strict types on what kind of data is expected to be on-disk, but maybe not?

ilan-gold avatar Oct 31 '25 13:10 ilan-gold

anndata-rs seems to handle only int32 and int64: https://github.com/kaizhang/anndata-rs/blob/df51c561e1efc453696fb2ae6c1a567e812f73a2/pyanndata/src/data/array.rs#L90

but that should be quick to remedy: https://github.com/kaizhang/anndata-rs/pull/18

flying-sheep avatar Nov 11 '25 14:11 flying-sheep

@flying-sheep Thanks and yes, this should be easy to do in anndata-rs. I'll add this feature to main after this feature is merged into python anndata.

kaizhang avatar Nov 12 '25 01:11 kaizhang