Writing dict in uns with many keys is slow
Please make sure these conditions are met
- [X] I have checked that this issue has not already been reported.
- [X] I have confirmed this bug exists on the latest version of anndata.
- [ ] (optional) I have confirmed this bug exists on the master branch of anndata.
Report
Code:
import anndata
import numpy as np
adata = anndata.AnnData()
adata.uns["x"] = {str(i): np.array(str(i), dtype="object") for i in range(20000)}
# %%time
adata.write_h5ad("/tmp/anndata.h5ad")
# %%time
anndata.read_h5ad("/tmp/anndata.h5ad")
On my machine, this takes 7s to write and 4s to load for a dictionary with only 20k elements. How hard would it be to make this (significantly) faster?
Additional context
In scirpy, I use dicts of arrays (one index referring to $n$ cells) to store clonotype clusters. The dictionary is not (necessarily) aligned to one of the axes, therefore it's in uns. As we sped up the clonotype clustering steps, saving the object becomes a major bottleneck, as this dict can have several hundreds of thousands of keys.
We could possibly change the dictionary to something more efficient, but that would mean breaking our data format. Therefore I first wanted to check if it can be made faster on the anndata side.
CC @felixpetschko
Versions
-----
anndata 0.9.2
numpy 1.24.4
session_info 1.0.0
-----
asciitree NA
asttokens NA
awkward 2.6.4
awkward_cpp NA
backcall 0.2.0
cloudpickle 2.2.1
comm 0.1.4
cython_runtime NA
dask 2023.8.1
dateutil 2.8.2
debugpy 1.6.8
decorator 5.1.1
entrypoints 0.4
executing 1.2.0
fasteners 0.18
fsspec 2023.6.0
h5py 3.9.0
importlib_metadata NA
ipykernel 6.25.0
jedi 0.19.0
...
Python 3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:40:32) [GCC 12.3.0]
Linux-6.10.5-arch1-1-x86_64-with-glibc2.40
-----
Session information updated at 2024-09-21 14:49
Hmmm @Gregor Sturm I would suspect the issue is that we recursively write the keys' values as their native data type, which means you end up creating thousands of zarr/hdf5 arrays. I'm not really sure we can do much about that at the moment. But with the coming zarr v3 we might in theory be able to do this in parallel, which would be a big boost. So I think we should wait for that: https://github.com/scverse/anndata/pull/1726 will be a first step just to getting things working.
I'm not sure the async/parallel zarr stuff works with v2, but I think it does.
Thanks for your response! I think we'll just adapt our data format to be more efficient in that case. Feel free to close.
This issue has been automatically marked as stale because it has not had recent activity. Please add a comment if you want to keep the issue open. Thank you for your contributions!
@grst I had a recent experience with python threadpools speeding up zarr by 2X, but not hdf5. I think hdf5 is already multithreaded under the hood, but if you want to experiment with that, it could be helpful. I may also take a whack at it. The idea would be to have a thread-per-elem-write.
File I/O is not subject to the GIL so in theory, this idea could help somewhat especially if you're not compressing.
I don't think that speedups in the order of 2x would get us far. We now anyway adopted a workaround that gave use >100x speedup.
This issue has been automatically marked as stale because it has not had recent activity. Please add a comment if you want to keep the issue open. Thank you for your contributions!
This issue has been automatically marked as stale because it has not had recent activity. Please add a comment if you want to keep the issue open. Thank you for your contributions!