kerchunk icon indicating copy to clipboard operation
kerchunk copied to clipboard

SingleHdf5ToZarr error/warning with scalar utf-8/ascii dataset

Open rly opened this issue 2 years ago • 4 comments

When I try to run SingleHdf5ToZarr on an HDF5 file with a scalar HDF5 dataset that has a variable length utf-8 string dtype or a variable length ascii bytes dtype, I get the following warning that an error was caught:

/Users/rly/mambaforge/envs/kerchunk/lib/python3.11/site-packages/kerchunk/hdf.py:497: UserWarning: The following excepion was caught and quashed while traversing HDF5
'str' object has no attribute 'extend'
Traceback (most recent call last):
  File "/Users/rly/mambaforge/envs/kerchunk/lib/python3.11/site-packages/kerchunk/hdf.py", line 438, in _translator
    za = self._zroot.create_dataset(
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/rly/mambaforge/envs/kerchunk/lib/python3.11/site-packages/zarr/hierarchy.py", line 1094, in create_dataset
    return self._write_op(self._create_dataset_nosync, name, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/rly/mambaforge/envs/kerchunk/lib/python3.11/site-packages/zarr/hierarchy.py", line 935, in _write_op
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/Users/rly/mambaforge/envs/kerchunk/lib/python3.11/site-packages/zarr/hierarchy.py", line 1110, in _create_dataset_nosync
    a = array(data, store=self._store, path=path, chunk_store=self._chunk_store, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/rly/mambaforge/envs/kerchunk/lib/python3.11/site-packages/zarr/creation.py", line 439, in array
    z[...] = data
    ~^^^^^
AttributeError: 'str' object has no attribute 'extend'

  warnings.warn(msg)

When I pass error="raise" to SingleHdf5ToZarr, I see the source of the error in numcodecs/json.py:

...
File ~/mambaforge/envs/kerchunk/lib/python3.11/site-packages/zarr/creation.py:439, in array(data, **kwargs)
    436 z = create(**kwargs)
    438 # fill with data
--> 439 z[...] = data
    441 # set read_only property afterwards
    442 z.read_only = read_only

File ~/mambaforge/envs/kerchunk/lib/python3.11/site-packages/zarr/core.py:1497, in Array.__setitem__(self, selection, value)
   1495     self.set_orthogonal_selection(pure_selection, value, fields=fields)
   1496 else:
-> 1497     self.set_basic_selection(pure_selection, value, fields=fields)

File ~/mambaforge/envs/kerchunk/lib/python3.11/site-packages/zarr/core.py:1591, in Array.set_basic_selection(self, selection, value, fields)
   1589 # handle zero-dimensional arrays
   1590 if self._shape == ():
-> 1591     return self._set_basic_selection_zd(selection, value, fields=fields)
   1592 else:
   1593     return self._set_basic_selection_nd(selection, value, fields=fields)

File ~/mambaforge/envs/kerchunk/lib/python3.11/site-packages/zarr/core.py:1974, in Array._set_basic_selection_zd(self, selection, value, fields)
   1971         pass
   1972 else:
   1973     # encode and store
-> 1974     cdata = self._encode_chunk(chunk)
   1975     self.chunk_store[ckey] = cdata

File ~/mambaforge/envs/kerchunk/lib/python3.11/site-packages/zarr/core.py:2436, in Array._encode_chunk(self, chunk)
   2434 if self._filters:
   2435     for f in self._filters:
-> 2436         chunk = f.encode(chunk)
   2438 # check object encoding
   2439 if ensure_ndarray_like(chunk).dtype == object:

File ~/mambaforge/envs/kerchunk/lib/python3.11/site-packages/numcodecs/json.py:59, in JSON.encode(self, buf)
     57 buf = np.asarray(buf)
     58 items = buf.tolist()
---> 59 items.extend((buf.dtype.str, buf.shape))
     60 return self._encoder.encode(items).encode(self._text_encoding)

AttributeError: 'str' object has no attribute 'extend'

Here, buf starts as a string. items is a string. It seems like JSON.encode assumes buf is array-like.

Example code to generate the HDF5 files:

import h5py

H5_STR = h5py.string_dtype("utf-8")
H5_BYTES = h5py.string_dtype("ascii")

with h5py.File("test_str.h5", "w") as f:
    f.create_dataset("data", data="test", shape=None, dtype=H5_STR)

with h5py.File("test_bytes.h5", "w") as f:
    f.create_dataset("data", data=b"test", shape=None, dtype=H5_BYTES)

Example code to generate the kerchunk reference JSON:

from kerchunk.hdf import SingleHdf5ToZarr
import fsspec
import ujson

fs_read = fsspec.filesystem('')  # local file system to read from
fs_write = fsspec.filesystem('')  # local file system to save final jsons to

def gen_json_from_local(local_file_path, final_remote_url, outf):
    with fs_read.open(local_file_path, 'rb') as infile:
        h5chunks = SingleHdf5ToZarr(infile, final_remote_url, inline_threshold=300, error="raise")

        with fs_write.open(outf, 'wb') as f:
            f.write(ujson.dumps(h5chunks.translate()).encode())

local_file_path = "/Users/rly/Documents/NWB/kerchunk-playground/test_str.h5"
final_remote_url = "s3://..."
outf = "test_str.json"  # file name to save json to
gen_json_from_local(local_file_path, final_remote_url, outf)

local_file_path = "/Users/rly/Documents/NWB/kerchunk-playground/test_bytes.h5"
final_remote_url = "s3://..."
outf = "test_bytes.json"  # file name to save json to
gen_json_from_local(local_file_path, final_remote_url, outf)

rly avatar Nov 02 '23 00:11 rly

I think this may be fixed in numcodecs, so please check exactly which version you have. Note that string/varchar encoding in HDF5 has a couple of (not great) options in kerchunk, vlen_encode=["embed", "null", "leave", "encode"], see the docstring of SingleHdf5ToZarr.

martindurant avatar Nov 02 '23 02:11 martindurant

Thanks for the quick response! I tried this with both numcodecs==0.11.0 (max set by kerchunk) and the latest numcodecs==0.12.1. Sorry - I forgot to mention that in my post. Also I'm on a Mac M1 with Python 3.11 in case that's relevant.

Exploring further, I think the issue is with Zarr and using the JSON object codec for scalar variable length arrays:

z = zarr.array(data="test", dtype=str, object_codec=numcodecs.JSON())
# AttributeError: 'str' object has no attribute 'extend'

I am a novice at Zarr, so I don't know if the following makes sense:

My understanding is that zarr.array(data="test", dtype=str) is a shortcut for dtype=object, object_codec=numcodecs.VLenUTF8(). Would it make sense to use the VLenUTF8 (or the VLenBytes) codec instead of the JSON codec for variable length strings? Or at least include that as an option?

Not sure if it should only be for scalars or could it work for any dataset shape. I tried modifying hdf.py to use the VLenUTF8 codec and ran some tests below.

For scalar datasets with the VLenUTF8 codec, the output is:

{
  "version": 1,
  "refs": {
    ".zgroup": "{\"zarr_format\":2}",
    "data/.zarray": "{\"chunks\":[],\"compressor\":null,\"dtype\":\"|O\",\"fill_value\":null,\"filters\":[{\"id\":\"vlen-utf8\"}],\"order\":\"C\",\"shape\":[],\"zarr_format\":2}",
    "data/0": "\u0001\u0000\u0000\u0000\u0004\u0000\u0000\u0000test",
    "data/.zattrs": "{\"_ARRAY_DIMENSIONS\":[]}"
  }
}

This decodes correctly with Zarr:

z = zarr.open("reference://", storage_options={"fo": "test_str.json"})
z["data"][()]
# 'test'

For non-scalar datasets, I used the VLenUTF8 codec and compared it with the output of the JSON codec.

For a non-scalar dataset ["test", "more test"] with the JSON codec, the output is:

{
  "version": 1,
  "refs": {
    ".zgroup": "{\"zarr_format\":2}",
    "data/.zarray": "{\"chunks\":[2],\"compressor\":null,\"dtype\":\"|O\",\"fill_value\":null,\"filters\":[{\"allow_nan\":true,\"check_circular\":true,\"encoding\":\"utf-8\",\"ensure_ascii\":true,\"id\":\"json2\",\"indent\":null,\"separators\":[\",\",\":\"],\"skipkeys\":false,\"sort_keys\":true,\"strict\":true}],\"order\":\"C\",\"shape\":[2],\"zarr_format\":2}",
    "data/0": "[\"test\",\"more test\",\"|O\",[2]]",
    "data/.zattrs": "{\"_ARRAY_DIMENSIONS\":[\"phony_dim_0\"]}"
  }
}

For a dataset of shape (2,) with the VLenUTF8 codec, the output is:

{
  "version": 1,
  "refs": {
    ".zgroup": "{\"zarr_format\":2}",
    "data/.zarray": "{\"chunks\":[2],\"compressor\":null,\"dtype\":\"|O\",\"fill_value\":null,\"filters\":[{\"id\":\"vlen-utf8\"}],\"order\":\"C\",\"shape\":[2],\"zarr_format\":2}",
    "data/0": "\u0002\u0000\u0000\u0000\u0004\u0000\u0000\u0000test\t\u0000\u0000\u0000more test",
    "data/.zattrs": "{\"_ARRAY_DIMENSIONS\":[\"phony_dim_0\"]}"
  }
}

Both decode correctly with Zarr:

z = zarr.open("reference://", storage_options={"fo": "test_multi_str_json_codec.json"})
z["data"][:]
# array(['test', 'more test'], dtype=object)

rly avatar Nov 02 '23 05:11 rly

Ah, I see my PR in numcodecs is still languishing: https://github.com/zarr-developers/numcodecs/pull/366 . You could ping it...

The VLenUTF8 could be OK, but it seems to end up more verbose, especially given that JSON is supposed to be ascii, and so that binary output should be base64 encoded in most cases; but this would be transparent to the user and maybe not important for short arrays or single values.

martindurant avatar Nov 02 '23 13:11 martindurant

Thanks! I did not find that issue ticket. That would do it. I pinged it.

Yeah, the VLenUTF8 codec is more verbose for non-scalars. For scalars, interestingly enough, it looks like you actually save some bytes using VLenUTF8 because the JSON codec filter description in .zarray is a bit verbose.

rly avatar Nov 02 '23 16:11 rly