Keep providing a non-nullable string type?
With pandas 3, string arrays will be inferred to become one of a dedicated pair of types (pd.arrays.{Arrow,}StringArray) and no longer behave like numpy object arrays, specifically they now have clear missing value / “nullable” behavior, see PDEP 14.
We added experimental support for writing {Arrow,}StringArray as “nullable string arrays” behind a flag in the past, and with anndata 0.13, this will become active by default.
However, this means that reading a “non-nullable” array from disk means it becomes a “nullable” one in memory, and subsequent writes will write them as nullable.
This raises two questions:
-
is there a in-memory overhead?
-
can we avoid the on-disk overhead?
we could either try a different serialization (oof, late!) or have a way for users to indicate (maybe using some string dtype variant)
Since the answer to the first question has no options, and is important for the second, I tried to find it below:
Results, TL;DR
no in-memory overhead, if we can trust the self-reported memory usage
NumpyExtensionArray with np.dtypes.StringDType() is small and non-nullable. We should test that it gets written using the old non-nullable string array serialization, and maybe advise users to convert string columns to it if that boolean mask on disk is too big. Opinions?
Results, Details
I tried to figure out how much memory the different options use (see below), but didn’t get a convincing result, I think since the memory just doesn’t get freed immediately
pd.arrays.NumpyExtensionArray is just a slim wrapper around a numpy array and doesn’t take noticably more memory. The same currently applies to pd.arrays.StringArray (i.e. pd.StringDtype(storage="python")).
So here are the self-reported stats (using pandas’ memory_usage which also tracks object overhead):
| Array class | dtype | #bytes |
|---|---|---|
NumpyExtensionArray |
"U10" |
40 MB |
NumpyExtensionArray |
object |
59 MB |
NumpyExtensionArray |
np.dtypes.StringDType(na_object=None) |
16 MB |
NumpyExtensionArray |
np.dtypes.StringDType() |
16 MB |
StringArray |
pd.StringDtype(storage="python") |
59 MB |
ArrowStringArray |
pd.StringDtype(storage="pyarrow") |
18 MB |
These are almost exactly what I would expect, but I’d have loved to actually verify that.
My confirmed assumptions (if these are to be trusted):
- Python objects have overhead over dedicated storage (safe assumption, this is why numpy and pandas exist)
-
NumpyExtensionArraywithdtype=objectis almost exactly as big asStringArray, and they’re the biggest (see above) -
np.dtypes.StringDType, which has the option to supportna_object, isn’t more effective when not using that (i.e. nullability is free)
My contradicted assumptions (if these are to be trusted):
-
ArrowStringArrayis smaller or has at most constant overhead overnp.dtypes.StringDType(seems to have O(N) overhead)
How I tried to measure it
I made this script intended to profile the memory.
script output
profile_numpy_unicode_mem (Self-reported: 40.00 MB):
Profiled:
Line # Mem usage Increment Occurrences Line Contents
=============================================================
52 297.3 MiB 297.3 MiB 1 @profile(backend=backend, stream=file)
53 def profile_numpy_unicode_mem() -> int:
54 338.2 MiB 40.9 MiB 1 arr = pd.arrays.NumpyExtensionArray(mk_raw()); gc.collect()
55 338.2 MiB 0.0 MiB 1 bytes_ = memory_usage(arr); gc.collect()
56 345.3 MiB 7.1 MiB 1 arr.tolist(); gc.collect()
57 307.1 MiB -38.1 MiB 1 del arr; gc.collect()
profile_numpy_obj_mem (Self-reported: 59.00 MB):
Profiled:
Line # Mem usage Increment Occurrences Line Contents
=============================================================
61 297.1 MiB 297.1 MiB 1 @profile(backend=backend, stream=file)
62 def profile_numpy_obj_mem() -> int:
63 366.8 MiB 69.8 MiB 1 arr = pd.arrays.NumpyExtensionArray(mk_raw(object)); gc.collect()
64 366.9 MiB 0.1 MiB 1 bytes_ = memory_usage(arr); gc.collect()
65 374.5 MiB 7.6 MiB 1 arr.tolist(); gc.collect()
66 315.5 MiB -59.1 MiB 1 del arr; gc.collect()
profile_numpy_str_mem_na (Self-reported: 16.00 MB):
Profiled:
Line # Mem usage Increment Occurrences Line Contents
=============================================================
70 297.1 MiB 297.1 MiB 1 @profile(backend=backend, stream=file)
71 def profile_numpy_str_mem_na() -> int:
72 313.3 MiB 16.1 MiB 1 arr = pd.arrays.NumpyExtensionArray(mk_raw(np.dtypes.StringDType(na_object=None))); gc.collect()
73 313.3 MiB 0.0 MiB 1 bytes_ = memory_usage(arr); gc.collect()
74 322.2 MiB 8.9 MiB 1 arr.tolist(); gc.collect()
75 322.2 MiB 0.0 MiB 1 del arr; gc.collect()
profile_numpy_str_mem (Self-reported: 16.00 MB):
Profiled:
Line # Mem usage Increment Occurrences Line Contents
=============================================================
79 297.0 MiB 297.0 MiB 1 @profile(backend=backend, stream=file)
80 def profile_numpy_str_mem() -> int:
81 313.1 MiB 16.1 MiB 1 arr = pd.arrays.NumpyExtensionArray(mk_raw(np.dtypes.StringDType())); gc.collect()
82 313.1 MiB 0.0 MiB 1 bytes_ = memory_usage(arr); gc.collect()
83 322.1 MiB 9.0 MiB 1 arr.tolist(); gc.collect()
84 322.1 MiB 0.0 MiB 1 del arr; gc.collect()
profile_python_mem (Self-reported: 59.00 MB):
Profiled:
Line # Mem usage Increment Occurrences Line Contents
=============================================================
88 297.1 MiB 297.1 MiB 1 @profile(backend=backend, stream=file)
89 def profile_python_mem() -> int:
90 367.0 MiB 69.9 MiB 1 arr = pd.array(mk_raw(), dtype=pd.StringDtype("python")); gc.collect()
91 367.0 MiB 0.0 MiB 1 bytes_ = memory_usage(arr); gc.collect()
92 374.6 MiB 7.7 MiB 1 arr.tolist(); gc.collect()
93 315.6 MiB -59.1 MiB 1 del arr; gc.collect()
profile_arrow_mem (Self-reported: 18.00 MB):
Profiled:
Line # Mem usage Increment Occurrences Line Contents
=============================================================
97 297.2 MiB 297.2 MiB 1 @profile(backend=backend, stream=file)
98 def profile_arrow_mem() -> int:
99 349.2 MiB 52.0 MiB 1 arr = pd.array(mk_raw(), dtype=pd.StringDtype("pyarrow")); gc.collect()
100 349.6 MiB 0.4 MiB 1 bytes_ = memory_usage(arr); gc.collect()
101 356.8 MiB 7.2 MiB 1 arr.tolist(); gc.collect()
102 356.9 MiB 0.1 MiB 1 del arr; gc.collect()
NumpyExtensionArray with np.dtypes.StringDType() is small and non-nullable. We should test that it gets written using the old non-nullable string array serialization, and maybe advise users to convert string columns to it if that boolean mask on disk is too big. Opinions?
https://github.com/scverse/anndata/blob/532386c38dd342170be6c97556b7a7ab0df49689/src/anndata/_io/specs/methods.py#L1177-L1193 So thinking about this a bit - what if None (i.e., the default for allow_write_nullable_strings) becomes us checking for null values? I doubt the runtime penalty is worse than writing those values to the disk (which is the current option). And in most cases I would assume that people are not using nullability. So we'd lose nothing, and then we could serialize to the old non-nullable format.
Then you have to set True to get the "always write nullable as nullable, no check"
WDYT?
If the benchmark says yes, then I’m for it!
Only issue I see is that "string" columns without NAs would then become "str" columns without nans.
I also noticed this while exploring pandas 3 support in spatialdata as it's the cause of 3 out of the 4 remaining failing tests here: https://github.com/scverse/spatialdata/actions/runs/20267039940/job/58192768530?pr=1034.
On disk, the first screenshot shows how a categorical obs was saved with pandas < 3, and the second with the latest pandas. The categories column is now a nullable string array.
pandas < 3
pandas 3
Hi @LucaMarconato yes, Phil and I discussed offline. I am making a follow-up PR to disable this now although our spec doesn't explicitly disallow it. I would probably be prepared to handle both (also because of the possibility of non-string categories etc.)
Thanks for the info, great! Btw it's not blocking spatialdata, I avoided the problem by changing (and improving) the failing tests.
Yeah, to be clear: pandas disallows a Categorical’s categories to be missing, you’ll see this if you try:
ValueError: Categorical categories cannot be null
Therefore Ilan will make it so the categories specifically are always written as non-nullable string array. This doesn’t prevent the categories from being a "string" dtype at runtime.
Right - if people have for some reason already written the data as a nullabel string on-disk (which would require quite a few knobs to turn), we won't break reading off that data. This is mostly for simplification/clarify