anndata
anndata copied to clipboard
Make `strings_to_categoricals` upon `.write()` optional
To illustrate that it'd be a simple change, this PR adds the 3 lines for what @ivirshup suggested here 3 years ago:
I think I'd be fine with a keyword argument for this.
In the two issues below, @grst, @adamgayoso, @lazappi & @ljjh20 expressed that "forced" sanitization is problematic.
- https://github.com/scverse/anndata/issues/534
- https://github.com/scverse/scanpy/issues/1747
I found these issues only now but also agree that a write function should not mutate the object that's about to be written. @chaichontat made me aware.
Things turn out particularly bad for gene symbols, which should remain strings even for performance reasons and compatibility with JS. Having 30k categories in a 32k dimensional genes vector is detrimental, not beneficial.
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 84.18%. Comparing base (
6e918a4) to head (7390740). Report is 73 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #1474 +/- ##
==========================================
- Coverage 86.69% 84.18% -2.51%
==========================================
Files 37 37
Lines 5937 5939 +2
==========================================
- Hits 5147 5000 -147
- Misses 790 939 +149
| Files with missing lines | Coverage Δ | |
|---|---|---|
| src/anndata/_io/h5ad.py | 93.00% <100.00%> (+0.03%) |
:arrow_up: |
| src/anndata/_io/zarr.py | 83.54% <100.00%> (+0.21%) |
:arrow_up: |
Just chipping in to say that while I think this would be great, it doesn't really solve https://github.com/scverse/scanpy/issues/1747 as scanpy functions will still cause this conversion (unless something has changed). So this helps if you are just reading/writing AnnData but anything that goes through scanpy is still likely to be converted to categoricals.
You probably knew this already but wanted to make it clear for anyone reading this later.
Just chipping in to say that while I think this would be great
Great that you're also onboard, @lazappi!
as scanpy functions will still cause this conversion (unless something has changed)
Yes, you're right about the Scanpy plotting functionality. But I think the issue is more pressing for anndata, given its widespread use outside of Scanpy.
@flying-sheep I am disabling auto-merge. Isaac brought up a good point about whether this should be in settings on zulip. What do you think? I think it's fine to leave this here because you might have different behavior for different datasets. If you agree we can merge (two against one haha).
We agreed on the following scope for settings (I’ll add checks for things that apply here)
- [x] special and uncommon use-case (performance etc), only to be done when you know what you’re doing, especially when turning off checks
- [ ] things that transitively apply (e.g. when concatenating/slicing) and therefore either need deep API changes or a global setting
I think we didn’t fully agree on if this should be an “and” or an “or” for things to go into settings.
I think IO is localized enough that it doesn’t need a global setting, but I have no strong opinion here.
@ivirshup your call then.
I am fine leaving this as-is @flying-sheep so will enable auto-merge
I think we didn’t fully agree on if this should be an “and” or an “or” for things to go into settings.
I am comfortable with "or" here
weird. We actually don’t document anndata.io.write_{zarr,h5ad} yet, and this PR didn’t add the API to anndata.AnnData.write_{zarr,h5ad}. I think we should fix all that.