Large number of dataframe columns cause hdf5 write error: Unable to create attribute (object header message is too large)
- [x] I have checked that this issue has not already been reported.
- [x] I have confirmed this bug exists on the latest version of scanpy.
- [ ] (optional) I have confirmed this bug exists on the master branch of scanpy.
Minimal code sample (that we can copy&paste without having any data)
Write any anndata with pearson residuals in uns
ad_all.write(filename='output/10x_h5/ad_all_2cello.h5ad')
The pearson_residual_df looks like this, with 38291 rows (obs) and 5000 columns (features) :
{'theta': 100,
'clip': None,
'computed_on': 'adata.X',
'pearson_residuals_df': gene_name A2M AADACL2-AS1 AAK1 ABCA1 \
barcode
GAACGTTCACACCGAC-1-placenta_81 -1.125285 -1.159130 -3.921314 -2.533474
TATACCTGTTAGCTAC-1-placenta_81 -1.091364 3.267127 -1.806667 -2.109586
CTCAAGAGTGACTGTT-1-placenta_81 -1.074943 12.272920 -1.948798 -2.735791
TTCATTGTCACGAACT-1-placenta_81 -1.098699 -1.131765 3.481171 4.472371
TATCAGGCAGCTCATA-1-placenta_81 -1.107734 -1.141064 -0.571775 -2.813671
... ... ... ... ...
CACAACATCGGCGATC-1-placenta_314 -0.115585 -0.119107 -0.434686 -0.303945
AGCCAGCGTGCCCAGT-1-placenta_314 -0.097424 -0.100394 -0.366482 -0.256219
CCGGTGAGTGTTCGAT-1-placenta_314 -0.110334 -0.113696 -0.414971 -0.290148
AGGTCATAGCCTGACC-1-placenta_314 -0.115585 -0.119107 -0.434686 -0.303945
TTTATGCCAAAGGGTC-1-placenta_314 -0.112876 -0.116316 -0.424515 -0.296827
Unable to create attribute (object header message is too large)
Above error raised while writing key 'pearson_residuals_df' of <class 'h5py._hl.group.Group'> to /
Versions
scanpy==1.9.1 anndata==0.8.0 umap==0.5.2 numpy==1.21.5 scipy==1.8.0 pandas==1.4.1 scikit-learn==1.0.2 statsmodels==0.13.2 python-igraph==0.9.9 pynndescent==0.5.6
A similar issue was brought up on the discourse.
An easy way to work around this is to store your data using the zarr format instead of hdf5 (e.g. anndata.read_zarr, anndata.write_zarr).
A better solution will take some effort. Here's some prior discussion from h5py: https://github.com/h5py/h5py/issues/1053. The maximum size for metadata on an hdf5 object can be increased using the H5Pset_attr_phase_change function in the C API. h5py has wrapped this at the cython level, but has not exposed this from the main API (https://github.com/h5py/h5py/pull/1638).
I believe we would need to:
- Figure out how to call this/ expose this from the
h5pypython api - Figure out how to determine when we need to allow larger objects in the metadata
I have the same issue. The zarr workaround works.
Being able to store sparse matrices with a certain chunksize would be great though. I think that's not possible atm.
Due to this line
Maybe this comment could help.
This issue has been automatically marked as stale because it has not had recent activity. Please add a comment if you want to keep the issue open. Thank you for your contributions!
@selmanozleyen, this is the h5py issue I was talking about. Do you think you could take a look at this?
If for pearson residue it's feasible to store the pearson_residual_df as a layer and other parameter values in uns?
@brainfo, oh, for sure. If it's a cells x genes dataframe I think you could just put it as a numpy array into layers, then call adata.to_df(layer="pearson_residuals") whenever you need the dataframe. I believe this should be zero-copy.