anndata icon indicating copy to clipboard operation
anndata copied to clipboard

Make `strings_to_categoricals` upon `.write()` optional

Open falexwolf opened this issue 1 year ago • 6 comments

To illustrate that it'd be a simple change, this PR adds the 3 lines for what @ivirshup suggested here 3 years ago:

I think I'd be fine with a keyword argument for this.

In the two issues below, @grst, @adamgayoso, @lazappi & @ljjh20 expressed that "forced" sanitization is problematic.

  • https://github.com/scverse/anndata/issues/534
  • https://github.com/scverse/scanpy/issues/1747

I found these issues only now but also agree that a write function should not mutate the object that's about to be written. @chaichontat made me aware.

Things turn out particularly bad for gene symbols, which should remain strings even for performance reasons and compatibility with JS. Having 30k categories in a 32k dimensional genes vector is detrimental, not beneficial.

falexwolf avatar Apr 22 '24 11:04 falexwolf

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 84.18%. Comparing base (6e918a4) to head (7390740). Report is 73 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1474      +/-   ##
==========================================
- Coverage   86.69%   84.18%   -2.51%     
==========================================
  Files          37       37              
  Lines        5937     5939       +2     
==========================================
- Hits         5147     5000     -147     
- Misses        790      939     +149     
Files with missing lines Coverage Δ
src/anndata/_io/h5ad.py 93.00% <100.00%> (+0.03%) :arrow_up:
src/anndata/_io/zarr.py 83.54% <100.00%> (+0.21%) :arrow_up:

... and 8 files with indirect coverage changes

codecov[bot] avatar Apr 22 '24 11:04 codecov[bot]

Just chipping in to say that while I think this would be great, it doesn't really solve https://github.com/scverse/scanpy/issues/1747 as scanpy functions will still cause this conversion (unless something has changed). So this helps if you are just reading/writing AnnData but anything that goes through scanpy is still likely to be converted to categoricals.

You probably knew this already but wanted to make it clear for anyone reading this later.

lazappi avatar Jul 03 '24 06:07 lazappi

Just chipping in to say that while I think this would be great

Great that you're also onboard, @lazappi!

as scanpy functions will still cause this conversion (unless something has changed)

Yes, you're right about the Scanpy plotting functionality. But I think the issue is more pressing for anndata, given its widespread use outside of Scanpy.

falexwolf avatar Jul 03 '24 06:07 falexwolf

@flying-sheep I am disabling auto-merge. Isaac brought up a good point about whether this should be in settings on zulip. What do you think? I think it's fine to leave this here because you might have different behavior for different datasets. If you agree we can merge (two against one haha).

ilan-gold avatar Jul 04 '24 08:07 ilan-gold

We agreed on the following scope for settings (I’ll add checks for things that apply here)

  • [x] special and uncommon use-case (performance etc), only to be done when you know what you’re doing, especially when turning off checks
  • [ ] things that transitively apply (e.g. when concatenating/slicing) and therefore either need deep API changes or a global setting

I think we didn’t fully agree on if this should be an “and” or an “or” for things to go into settings.

I think IO is localized enough that it doesn’t need a global setting, but I have no strong opinion here.

flying-sheep avatar Jul 04 '24 08:07 flying-sheep

@ivirshup your call then.

ilan-gold avatar Jul 04 '24 09:07 ilan-gold

I am fine leaving this as-is @flying-sheep so will enable auto-merge

ilan-gold avatar Aug 09 '24 13:08 ilan-gold

I think we didn’t fully agree on if this should be an “and” or an “or” for things to go into settings.

I am comfortable with "or" here

ilan-gold avatar Aug 09 '24 13:08 ilan-gold

weird. We actually don’t document anndata.io.write_{zarr,h5ad} yet, and this PR didn’t add the API to anndata.AnnData.write_{zarr,h5ad}. I think we should fix all that.

flying-sheep avatar Mar 17 '25 09:03 flying-sheep