anndata Make `strings_to_categoricals` upon `.write()` optional

To illustrate that it'd be a simple change, this PR adds the 3 lines for what @ivirshup suggested here 3 years ago:

I think I'd be fine with a keyword argument for this.

In the two issues below, @grst, @adamgayoso, @lazappi & @ljjh20 expressed that "forced" sanitization is problematic.

https://github.com/scverse/anndata/issues/534
https://github.com/scverse/scanpy/issues/1747

I found these issues only now but also agree that a write function should not mutate the object that's about to be written. @chaichontat made me aware.

Things turn out particularly bad for gene symbols, which should remain strings even for performance reasons and compatibility with JS. Having 30k categories in a 32k dimensional genes vector is detrimental, not beneficial.

Apr 22 '24 11:04 falexwolf

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 84.18%. Comparing base (6e918a4) to head (7390740). Report is 73 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1474      +/-   ##
==========================================
- Coverage   86.69%   84.18%   -2.51%     
==========================================
  Files          37       37              
  Lines        5937     5939       +2     
==========================================
- Hits         5147     5000     -147     
- Misses        790      939     +149

Files with missing lines	Coverage Δ
src/anndata/_io/h5ad.py	`93.00% <100.00%> (+0.03%)`	:arrow_up:
src/anndata/_io/zarr.py	`83.54% <100.00%> (+0.21%)`	:arrow_up:

... and 8 files with indirect coverage changes

Apr 22 '24 11:04 codecov[bot]

Just chipping in to say that while I think this would be great, it doesn't really solve https://github.com/scverse/scanpy/issues/1747 as scanpy functions will still cause this conversion (unless something has changed). So this helps if you are just reading/writing AnnData but anything that goes through scanpy is still likely to be converted to categoricals.

You probably knew this already but wanted to make it clear for anyone reading this later.

Jul 03 '24 06:07 lazappi

Just chipping in to say that while I think this would be great

Great that you're also onboard, @lazappi!

as scanpy functions will still cause this conversion (unless something has changed)

Yes, you're right about the Scanpy plotting functionality. But I think the issue is more pressing for anndata, given its widespread use outside of Scanpy.

Jul 03 '24 06:07 falexwolf

@flying-sheep I am disabling auto-merge. Isaac brought up a good point about whether this should be in settings on zulip. What do you think? I think it's fine to leave this here because you might have different behavior for different datasets. If you agree we can merge (two against one haha).

Jul 04 '24 08:07 ilan-gold

We agreed on the following scope for settings (I’ll add checks for things that apply here)

[x] special and uncommon use-case (performance etc), only to be done when you know what you’re doing, especially when turning off checks
[ ] things that transitively apply (e.g. when concatenating/slicing) and therefore either need deep API changes or a global setting

I think we didn’t fully agree on if this should be an “and” or an “or” for things to go into settings.

I think IO is localized enough that it doesn’t need a global setting, but I have no strong opinion here.

Jul 04 '24 08:07 flying-sheep

@ivirshup your call then.

Jul 04 '24 09:07 ilan-gold

I am fine leaving this as-is @flying-sheep so will enable auto-merge

Aug 09 '24 13:08 ilan-gold

I think we didn’t fully agree on if this should be an “and” or an “or” for things to go into settings.

I am comfortable with "or" here

Aug 09 '24 13:08 ilan-gold

weird. We actually don’t document anndata.io.write_{zarr,h5ad} yet, and this PR didn’t add the API to anndata.AnnData.write_{zarr,h5ad}. I think we should fix all that.

Mar 17 '25 09:03 flying-sheep

anndata anndata copied to clipboard

Make `strings_to_categoricals` upon `.write()` optional

Codecov Report

anndata
anndata copied to clipboard