anndata icon indicating copy to clipboard operation
anndata copied to clipboard

Distributed writing for H5ad format due to h5py objects being unserializable

Open selmanozleyen opened this issue 1 year ago • 1 comments

Please make sure these conditions are met

  • [X] I have checked that this issue has not already been reported.
  • [X] I have confirmed this bug exists on the latest version of anndata.
  • [x] (optional) I have confirmed this bug exists on the master branch of anndata.

Report

This is the code that will fail.

import anndata as ad
import dask.array as da
import dask.distributed as dd

with dd.LocalCluster(n_workers=1,threads_per_worker=1) as cluster:
    with dd.Client(cluster) as client:
        adata = ad.AnnData(da.random.random((100, 100), chunks=(10, 10)))
        adata.write_h5ad("test.h5ad")

Usually the same code used to fail for both zarr and h5ad, but this PR will fix the issue with zarr https://github.com/scverse/anndata/pull/1079. For h5ad serialization of h5py might be overcome by whatever Xarray does as mentioned in this issue https://github.com/pydata/xarray/issues/4242

Traceback:

023-08-25 11:17:10,491 - distributed.protocol.pickle - ERROR - Failed to serialize <ToPickle: HighLevelGraph with 1 layers.
<dask.highlevelgraph.HighLevelGraph object at 0x7f6ae131c700>
 0. 140097021523072
>.
Traceback (most recent call last):
  File "/home/sel/mambaforge/envs/dask/lib/python3.9/site-packages/distributed/protocol/pickle.py", line 63, in dumps
    result = pickle.dumps(x, **dump_kwargs)
  File "/home/sel/mambaforge/envs/dask/lib/python3.9/site-packages/h5py/_hl/base.py", line 368, in __getnewargs__
    raise TypeError("h5py objects cannot be pickled")
TypeError: h5py objects cannot be pickled

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/sel/mambaforge/envs/dask/lib/python3.9/site-packages/distributed/protocol/pickle.py", line 68, in dumps
    pickler.dump(x)
  File "/home/sel/mambaforge/envs/dask/lib/python3.9/site-packages/distributed/protocol/pickle.py", line 29, in reducer_override
    return deserialize, serialize(obj)
  File "/home/sel/mambaforge/envs/dask/lib/python3.9/site-packages/distributed/protocol/h5py.py", line 24, in serialize_h5py_dataset
    header, _ = serialize_h5py_file(x.file)
  File "/home/sel/mambaforge/envs/dask/lib/python3.9/site-packages/distributed/protocol/h5py.py", line 11, in serialize_h5py_file
    raise ValueError("Can only serialize read-only h5py files")
ValueError: Can only serialize read-only h5py files

During handling of the above exception, another exception occurred:
...
    return Pickler.dump(self, obj)
  File "/home/sel/mambaforge/envs/dask/lib/python3.9/site-packages/h5py/_hl/base.py", line 368, in __getnewargs__
    raise TypeError("h5py objects cannot be pickled")
TypeError: h5py objects cannot be pickled

Versions

-----
anndata             0.10.0.dev198+ga61d5d4
dask                2023.7.1
distributed         2023.7.1
numpy               1.22.4
pandas              2.0.0
scipy               1.9.3
session_info        1.0.0
zarr                2.13.3
-----
PIL                 9.2.0
asciitree           NA
asttokens           NA
attr                23.1.0
awkward             2.1.0
awkward_cpp         NA
backcall            0.2.0
bokeh               2.4.3
cffi                1.15.1
click               8.1.3
cloudpickle         2.2.0
colorama            0.4.6
comm                0.1.1
cython_runtime      NA
cytoolz             0.12.0
...
Python 3.9.15 | packaged by conda-forge | (main, Nov 22 2022, 15:55:03) [GCC 10.4.0]
Linux-6.1.44-1-MANJARO-x86_64-with-glibc2.38
-----

Session information updated at 2023-08-25 11:18

selmanozleyen avatar Aug 25 '23 09:08 selmanozleyen