Guidance on compression options with `AnnData.write_zarr`
Question
Hello, thanks for the great support on this package! I was trying to figure out how compression operates for the h5ad format and also for zarr to compare them equivalently (where possible). On the docs for AnnData.write_h5ad I noticed this excerpt:
Datasets written with hdf5plugin-provided compressors cannot be opened without first loading the hdf5plugin library using import hdf5plugin. When using alternative compression filters such as zstd, consider writing to zarr format instead of h5ad, as the zarr library provides a more transparent compression pipeline.
When I navigated to the docs for AnnData.write_zarr I didn't find any arguments or guidance which could be used for compression options and the AnnData Zarr format.
- Would you have any guidance or recommendations for compression options and
AnnData.write_zarr? Please don't hesitate to point me to a link to learn more if available. - Could I ask for more clarification on "more transparent compression pipeline" from the
AnnData.write_h5addocs? I wasn't sure if this meant we could/should customize compression for Zarr exports outside of AnnData or if it meant "better" somehow thanh5adexport compression performance (or maybe both).
anndata.write_zarr accepts a compression kwarg that is not properly documented that is then applied to every subarray identically. It looks like AnnData.write_zarr does not accept kwargs so that should be rectified.
Would you have any guidance or recommendations for compression options and AnnData.write_zarr? Please don't hesitate to point me to a link to learn more if available.
In terms of what options you should use, I generally advise using Blosc as it has traditionally been the strongest option (see https://pmc.ncbi.nlm.nih.gov/articles/PMC9900847/ for example) and have no reason to believe that has changed. Somewhat unfortunately, it is no longer the default in zarr-python v3 the package, see https://anndata.readthedocs.io/en/stable/tutorials/zarr-v3.html
Could I ask for more clarification on "more transparent compression pipeline" from the AnnData.write_h5ad docs? I wasn't sure if this meant we could/should customize compression for Zarr exports outside of AnnData or if it meant "better" somehow than h5ad export compression performance (or maybe both).
I was not the one who wrote this, but presumably it refers to what I wrote at the outset of this comment, that you can only pass one compressor for everything (or nothing). I am not sure what you mean by "customize compression for Zarr exports", but something I hinted at in your PR over in that other package is that if you want control over the full io "pipeline", you might be better off going element-by-element i.e., read_elem read_elem_lazy write_elem read_dispatched etc. where you could pass different options to different elems.
If you're interested in zarr, I would also recommend using sharding for v3. I would like to turn it on for the next release, but need to think about the API. If you want to chat and/or collaborate on that or better io pipelining, would love to talk on our zulip! https://scverse.zulipchat.com/
Thanks for these details @ilan-gold ! I feel this addresses the direct questions I had.
This issue has been automatically marked as stale because it has not had recent activity. Please add a comment if you want to keep the issue open. Thank you for your contributions!