hdf5 icon indicating copy to clipboard operation
hdf5 copied to clipboard

[Feature Request] Compression for variable length string data

Open ivirshup opened this issue 2 years ago • 0 comments

Is your feature request related to a problem? Please describe.

Variable length strings take a lot of space on-disk.

Previous discussion:

  • h5py/h5py#2162
  • scverse/anndata#830
  • scverse/anndata#822

vlen strings lead to larger files

import h5py
import numpy as np

strings = np.array("the quick brown fox jumps over the lazy dog".split(), dtype=object)
data = np.random.choice(strings, 1_000_000)

with h5py.File("vlen_strings.h5", "w") as f:
    str_dtype = h5py.string_dtype("utf-8")
    f.create_dataset("string-array", data=data.astype(str_dtype), dtype=str_dtype, compression="gzip")

with h5py.File("fixed_len_strings.h5", "w") as f:
    max_len = int(np.vectorize(len)(data).max())
    str_dtype = h5py.string_dtype("utf-8", max_len)
    f.create_dataset("string-array", data=data.astype(str_dtype), dtype=str_dtype, compression="gzip")

!du -hs *
712K	fixed_len_strings.h5
 25M	vlen_strings.h5

Describe the solution you'd like

I would like these strings to take up less space.

My understanding is that variable length strings have a large 32 byte overhead, which does not seem to get compressed. This seems to check out with the sizes of files generated.

My naive assumption would be that wherever this overhead is being stored could also be compressed.

Describe alternatives you've considered

Using fixed length instead:

  • This would be a change to our libraries format, and would take consideration. It could easily break out of core appending of new tables.
  • It seems like some of our target clients do not support fixed length unicode types, like HDF5.jl

Different storage format

In the issue for h5py, it was suggested we use a different storage format. HDF5 is particularly useful to us because so many languages can read it. We do also use zarr, but it would be nice to limit the downsides of hdf5.

Some other change to hdf5's variable length encoding

Some other solution (mentioned more below) could also be good. I would imagine adding compression to the current format would be an easier lift though.

Additional context

The hdf5 docs do have an RFC about inefficiencies of variable length data typesbut it doesn't seem like there has been any progress made towards implementation

ivirshup avatar Oct 12 '22 18:10 ivirshup