anndata icon indicating copy to clipboard operation
anndata copied to clipboard

Keep providing a non-nullable string type?

Open flying-sheep opened this issue 4 weeks ago • 6 comments

With pandas 3, string arrays will be inferred to become one of a dedicated pair of types (pd.arrays.{Arrow,}StringArray) and no longer behave like numpy object arrays, specifically they now have clear missing value / “nullable” behavior, see PDEP 14.

We added experimental support for writing {Arrow,}StringArray as “nullable string arrays” behind a flag in the past, and with anndata 0.13, this will become active by default.

However, this means that reading a “non-nullable” array from disk means it becomes a “nullable” one in memory, and subsequent writes will write them as nullable.

This raises two questions:

  • is there a in-memory overhead?

  • can we avoid the on-disk overhead?

    we could either try a different serialization (oof, late!) or have a way for users to indicate (maybe using some string dtype variant)

Since the answer to the first question has no options, and is important for the second, I tried to find it below:

Results, TL;DR

no in-memory overhead, if we can trust the self-reported memory usage

NumpyExtensionArray with np.dtypes.StringDType() is small and non-nullable. We should test that it gets written using the old non-nullable string array serialization, and maybe advise users to convert string columns to it if that boolean mask on disk is too big. Opinions?

Results, Details

I tried to figure out how much memory the different options use (see below), but didn’t get a convincing result, I think since the memory just doesn’t get freed immediately

pd.arrays.NumpyExtensionArray is just a slim wrapper around a numpy array and doesn’t take noticably more memory. The same currently applies to pd.arrays.StringArray (i.e. pd.StringDtype(storage="python")).

So here are the self-reported stats (using pandas’ memory_usage which also tracks object overhead):

Array class dtype #bytes
NumpyExtensionArray "U10" 40 MB
NumpyExtensionArray object 59 MB
NumpyExtensionArray np.dtypes.StringDType(na_object=None) 16 MB
NumpyExtensionArray np.dtypes.StringDType() 16 MB
StringArray pd.StringDtype(storage="python") 59 MB
ArrowStringArray pd.StringDtype(storage="pyarrow") 18 MB

These are almost exactly what I would expect, but I’d have loved to actually verify that.

My confirmed assumptions (if these are to be trusted):

  • Python objects have overhead over dedicated storage (safe assumption, this is why numpy and pandas exist)
  • NumpyExtensionArray with dtype=object is almost exactly as big as StringArray, and they’re the biggest (see above)
  • np.dtypes.StringDType, which has the option to support na_object, isn’t more effective when not using that (i.e. nullability is free)

My contradicted assumptions (if these are to be trusted):

  • ArrowStringArray is smaller or has at most constant overhead over np.dtypes.StringDType (seems to have O(N) overhead)

How I tried to measure it

I made this script intended to profile the memory.

script output

profile_numpy_unicode_mem (Self-reported: 40.00 MB):
Profiled:
Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    52    297.3 MiB    297.3 MiB           1   @profile(backend=backend, stream=file)
    53                                         def profile_numpy_unicode_mem() -> int:
    54    338.2 MiB     40.9 MiB           1       arr = pd.arrays.NumpyExtensionArray(mk_raw()); gc.collect()
    55    338.2 MiB      0.0 MiB           1       bytes_ = memory_usage(arr); gc.collect()
    56    345.3 MiB      7.1 MiB           1       arr.tolist(); gc.collect()
    57    307.1 MiB    -38.1 MiB           1       del arr; gc.collect()

profile_numpy_obj_mem (Self-reported: 59.00 MB):
Profiled:
Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    61    297.1 MiB    297.1 MiB           1   @profile(backend=backend, stream=file)
    62                                         def profile_numpy_obj_mem() -> int:
    63    366.8 MiB     69.8 MiB           1       arr = pd.arrays.NumpyExtensionArray(mk_raw(object)); gc.collect()
    64    366.9 MiB      0.1 MiB           1       bytes_ = memory_usage(arr); gc.collect()
    65    374.5 MiB      7.6 MiB           1       arr.tolist(); gc.collect()
    66    315.5 MiB    -59.1 MiB           1       del arr; gc.collect()

profile_numpy_str_mem_na (Self-reported: 16.00 MB):
Profiled:
Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    70    297.1 MiB    297.1 MiB           1   @profile(backend=backend, stream=file)
    71                                         def profile_numpy_str_mem_na() -> int:
    72    313.3 MiB     16.1 MiB           1       arr = pd.arrays.NumpyExtensionArray(mk_raw(np.dtypes.StringDType(na_object=None))); gc.collect()
    73    313.3 MiB      0.0 MiB           1       bytes_ = memory_usage(arr); gc.collect()
    74    322.2 MiB      8.9 MiB           1       arr.tolist(); gc.collect()
    75    322.2 MiB      0.0 MiB           1       del arr; gc.collect()

profile_numpy_str_mem (Self-reported: 16.00 MB):
Profiled:
Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    79    297.0 MiB    297.0 MiB           1   @profile(backend=backend, stream=file)
    80                                         def profile_numpy_str_mem() -> int:
    81    313.1 MiB     16.1 MiB           1       arr = pd.arrays.NumpyExtensionArray(mk_raw(np.dtypes.StringDType())); gc.collect()
    82    313.1 MiB      0.0 MiB           1       bytes_ = memory_usage(arr); gc.collect()
    83    322.1 MiB      9.0 MiB           1       arr.tolist(); gc.collect()
    84    322.1 MiB      0.0 MiB           1       del arr; gc.collect()

profile_python_mem (Self-reported: 59.00 MB):
Profiled:
Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    88    297.1 MiB    297.1 MiB           1   @profile(backend=backend, stream=file)
    89                                         def profile_python_mem() -> int:
    90    367.0 MiB     69.9 MiB           1       arr = pd.array(mk_raw(), dtype=pd.StringDtype("python")); gc.collect()
    91    367.0 MiB      0.0 MiB           1       bytes_ = memory_usage(arr); gc.collect()
    92    374.6 MiB      7.7 MiB           1       arr.tolist(); gc.collect()
    93    315.6 MiB    -59.1 MiB           1       del arr; gc.collect()

profile_arrow_mem (Self-reported: 18.00 MB):
Profiled:
Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    97    297.2 MiB    297.2 MiB           1   @profile(backend=backend, stream=file)
    98                                         def profile_arrow_mem() -> int:
    99    349.2 MiB     52.0 MiB           1       arr = pd.array(mk_raw(), dtype=pd.StringDtype("pyarrow")); gc.collect()
   100    349.6 MiB      0.4 MiB           1       bytes_ = memory_usage(arr); gc.collect()
   101    356.8 MiB      7.2 MiB           1       arr.tolist(); gc.collect()
   102    356.9 MiB      0.1 MiB           1       del arr; gc.collect()

flying-sheep avatar Dec 16 '25 12:12 flying-sheep

NumpyExtensionArray with np.dtypes.StringDType() is small and non-nullable. We should test that it gets written using the old non-nullable string array serialization, and maybe advise users to convert string columns to it if that boolean mask on disk is too big. Opinions?

https://github.com/scverse/anndata/blob/532386c38dd342170be6c97556b7a7ab0df49689/src/anndata/_io/specs/methods.py#L1177-L1193 So thinking about this a bit - what if None (i.e., the default for allow_write_nullable_strings) becomes us checking for null values? I doubt the runtime penalty is worse than writing those values to the disk (which is the current option). And in most cases I would assume that people are not using nullability. So we'd lose nothing, and then we could serialize to the old non-nullable format.

Then you have to set True to get the "always write nullable as nullable, no check"

WDYT?

ilan-gold avatar Dec 16 '25 13:12 ilan-gold

If the benchmark says yes, then I’m for it!

Only issue I see is that "string" columns without NAs would then become "str" columns without nans.

flying-sheep avatar Dec 16 '25 13:12 flying-sheep

I also noticed this while exploring pandas 3 support in spatialdata as it's the cause of 3 out of the 4 remaining failing tests here: https://github.com/scverse/spatialdata/actions/runs/20267039940/job/58192768530?pr=1034.

On disk, the first screenshot shows how a categorical obs was saved with pandas < 3, and the second with the latest pandas. The categories column is now a nullable string array.

pandas < 3

Image

pandas 3

Image

LucaMarconato avatar Dec 16 '25 13:12 LucaMarconato

Hi @LucaMarconato yes, Phil and I discussed offline. I am making a follow-up PR to disable this now although our spec doesn't explicitly disallow it. I would probably be prepared to handle both (also because of the possibility of non-string categories etc.)

ilan-gold avatar Dec 16 '25 13:12 ilan-gold

Thanks for the info, great! Btw it's not blocking spatialdata, I avoided the problem by changing (and improving) the failing tests.

LucaMarconato avatar Dec 16 '25 13:12 LucaMarconato

Yeah, to be clear: pandas disallows a Categorical’s categories to be missing, you’ll see this if you try:

ValueError: Categorical categories cannot be null

Therefore Ilan will make it so the categories specifically are always written as non-nullable string array. This doesn’t prevent the categories from being a "string" dtype at runtime.

flying-sheep avatar Dec 16 '25 17:12 flying-sheep

Right - if people have for some reason already written the data as a nullabel string on-disk (which would require quite a few knobs to turn), we won't break reading off that data. This is mostly for simplification/clarify

ilan-gold avatar Dec 17 '25 10:12 ilan-gold