kerchunk
kerchunk copied to clipboard
SingleHdf5ToZarr error/warning with scalar utf-8/ascii dataset
When I try to run SingleHdf5ToZarr on an HDF5 file with a scalar HDF5 dataset that has a variable length utf-8 string dtype or a variable length ascii bytes dtype, I get the following warning that an error was caught:
/Users/rly/mambaforge/envs/kerchunk/lib/python3.11/site-packages/kerchunk/hdf.py:497: UserWarning: The following excepion was caught and quashed while traversing HDF5
'str' object has no attribute 'extend'
Traceback (most recent call last):
File "/Users/rly/mambaforge/envs/kerchunk/lib/python3.11/site-packages/kerchunk/hdf.py", line 438, in _translator
za = self._zroot.create_dataset(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/rly/mambaforge/envs/kerchunk/lib/python3.11/site-packages/zarr/hierarchy.py", line 1094, in create_dataset
return self._write_op(self._create_dataset_nosync, name, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/rly/mambaforge/envs/kerchunk/lib/python3.11/site-packages/zarr/hierarchy.py", line 935, in _write_op
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/Users/rly/mambaforge/envs/kerchunk/lib/python3.11/site-packages/zarr/hierarchy.py", line 1110, in _create_dataset_nosync
a = array(data, store=self._store, path=path, chunk_store=self._chunk_store, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/rly/mambaforge/envs/kerchunk/lib/python3.11/site-packages/zarr/creation.py", line 439, in array
z[...] = data
~^^^^^
AttributeError: 'str' object has no attribute 'extend'
warnings.warn(msg)
When I pass error="raise" to SingleHdf5ToZarr, I see the source of the error in numcodecs/json.py:
...
File ~/mambaforge/envs/kerchunk/lib/python3.11/site-packages/zarr/creation.py:439, in array(data, **kwargs)
436 z = create(**kwargs)
438 # fill with data
--> 439 z[...] = data
441 # set read_only property afterwards
442 z.read_only = read_only
File ~/mambaforge/envs/kerchunk/lib/python3.11/site-packages/zarr/core.py:1497, in Array.__setitem__(self, selection, value)
1495 self.set_orthogonal_selection(pure_selection, value, fields=fields)
1496 else:
-> 1497 self.set_basic_selection(pure_selection, value, fields=fields)
File ~/mambaforge/envs/kerchunk/lib/python3.11/site-packages/zarr/core.py:1591, in Array.set_basic_selection(self, selection, value, fields)
1589 # handle zero-dimensional arrays
1590 if self._shape == ():
-> 1591 return self._set_basic_selection_zd(selection, value, fields=fields)
1592 else:
1593 return self._set_basic_selection_nd(selection, value, fields=fields)
File ~/mambaforge/envs/kerchunk/lib/python3.11/site-packages/zarr/core.py:1974, in Array._set_basic_selection_zd(self, selection, value, fields)
1971 pass
1972 else:
1973 # encode and store
-> 1974 cdata = self._encode_chunk(chunk)
1975 self.chunk_store[ckey] = cdata
File ~/mambaforge/envs/kerchunk/lib/python3.11/site-packages/zarr/core.py:2436, in Array._encode_chunk(self, chunk)
2434 if self._filters:
2435 for f in self._filters:
-> 2436 chunk = f.encode(chunk)
2438 # check object encoding
2439 if ensure_ndarray_like(chunk).dtype == object:
File ~/mambaforge/envs/kerchunk/lib/python3.11/site-packages/numcodecs/json.py:59, in JSON.encode(self, buf)
57 buf = np.asarray(buf)
58 items = buf.tolist()
---> 59 items.extend((buf.dtype.str, buf.shape))
60 return self._encoder.encode(items).encode(self._text_encoding)
AttributeError: 'str' object has no attribute 'extend'
Here, buf starts as a string. items is a string. It seems like JSON.encode assumes buf is array-like.
Example code to generate the HDF5 files:
import h5py
H5_STR = h5py.string_dtype("utf-8")
H5_BYTES = h5py.string_dtype("ascii")
with h5py.File("test_str.h5", "w") as f:
f.create_dataset("data", data="test", shape=None, dtype=H5_STR)
with h5py.File("test_bytes.h5", "w") as f:
f.create_dataset("data", data=b"test", shape=None, dtype=H5_BYTES)
Example code to generate the kerchunk reference JSON:
from kerchunk.hdf import SingleHdf5ToZarr
import fsspec
import ujson
fs_read = fsspec.filesystem('') # local file system to read from
fs_write = fsspec.filesystem('') # local file system to save final jsons to
def gen_json_from_local(local_file_path, final_remote_url, outf):
with fs_read.open(local_file_path, 'rb') as infile:
h5chunks = SingleHdf5ToZarr(infile, final_remote_url, inline_threshold=300, error="raise")
with fs_write.open(outf, 'wb') as f:
f.write(ujson.dumps(h5chunks.translate()).encode())
local_file_path = "/Users/rly/Documents/NWB/kerchunk-playground/test_str.h5"
final_remote_url = "s3://..."
outf = "test_str.json" # file name to save json to
gen_json_from_local(local_file_path, final_remote_url, outf)
local_file_path = "/Users/rly/Documents/NWB/kerchunk-playground/test_bytes.h5"
final_remote_url = "s3://..."
outf = "test_bytes.json" # file name to save json to
gen_json_from_local(local_file_path, final_remote_url, outf)
I think this may be fixed in numcodecs, so please check exactly which version you have. Note that string/varchar encoding in HDF5 has a couple of (not great) options in kerchunk, vlen_encode=["embed", "null", "leave", "encode"], see the docstring of SingleHdf5ToZarr.
Thanks for the quick response! I tried this with both numcodecs==0.11.0 (max set by kerchunk) and the latest numcodecs==0.12.1. Sorry - I forgot to mention that in my post. Also I'm on a Mac M1 with Python 3.11 in case that's relevant.
Exploring further, I think the issue is with Zarr and using the JSON object codec for scalar variable length arrays:
z = zarr.array(data="test", dtype=str, object_codec=numcodecs.JSON())
# AttributeError: 'str' object has no attribute 'extend'
I am a novice at Zarr, so I don't know if the following makes sense:
My understanding is that zarr.array(data="test", dtype=str) is a shortcut for dtype=object, object_codec=numcodecs.VLenUTF8(). Would it make sense to use the VLenUTF8 (or the VLenBytes) codec instead of the JSON codec for variable length strings? Or at least include that as an option?
Not sure if it should only be for scalars or could it work for any dataset shape. I tried modifying hdf.py to use the VLenUTF8 codec and ran some tests below.
For scalar datasets with the VLenUTF8 codec, the output is:
{
"version": 1,
"refs": {
".zgroup": "{\"zarr_format\":2}",
"data/.zarray": "{\"chunks\":[],\"compressor\":null,\"dtype\":\"|O\",\"fill_value\":null,\"filters\":[{\"id\":\"vlen-utf8\"}],\"order\":\"C\",\"shape\":[],\"zarr_format\":2}",
"data/0": "\u0001\u0000\u0000\u0000\u0004\u0000\u0000\u0000test",
"data/.zattrs": "{\"_ARRAY_DIMENSIONS\":[]}"
}
}
This decodes correctly with Zarr:
z = zarr.open("reference://", storage_options={"fo": "test_str.json"})
z["data"][()]
# 'test'
For non-scalar datasets, I used the VLenUTF8 codec and compared it with the output of the JSON codec.
For a non-scalar dataset ["test", "more test"] with the JSON codec, the output is:
{
"version": 1,
"refs": {
".zgroup": "{\"zarr_format\":2}",
"data/.zarray": "{\"chunks\":[2],\"compressor\":null,\"dtype\":\"|O\",\"fill_value\":null,\"filters\":[{\"allow_nan\":true,\"check_circular\":true,\"encoding\":\"utf-8\",\"ensure_ascii\":true,\"id\":\"json2\",\"indent\":null,\"separators\":[\",\",\":\"],\"skipkeys\":false,\"sort_keys\":true,\"strict\":true}],\"order\":\"C\",\"shape\":[2],\"zarr_format\":2}",
"data/0": "[\"test\",\"more test\",\"|O\",[2]]",
"data/.zattrs": "{\"_ARRAY_DIMENSIONS\":[\"phony_dim_0\"]}"
}
}
For a dataset of shape (2,) with the VLenUTF8 codec, the output is:
{
"version": 1,
"refs": {
".zgroup": "{\"zarr_format\":2}",
"data/.zarray": "{\"chunks\":[2],\"compressor\":null,\"dtype\":\"|O\",\"fill_value\":null,\"filters\":[{\"id\":\"vlen-utf8\"}],\"order\":\"C\",\"shape\":[2],\"zarr_format\":2}",
"data/0": "\u0002\u0000\u0000\u0000\u0004\u0000\u0000\u0000test\t\u0000\u0000\u0000more test",
"data/.zattrs": "{\"_ARRAY_DIMENSIONS\":[\"phony_dim_0\"]}"
}
}
Both decode correctly with Zarr:
z = zarr.open("reference://", storage_options={"fo": "test_multi_str_json_codec.json"})
z["data"][:]
# array(['test', 'more test'], dtype=object)
Ah, I see my PR in numcodecs is still languishing: https://github.com/zarr-developers/numcodecs/pull/366 . You could ping it...
The VLenUTF8 could be OK, but it seems to end up more verbose, especially given that JSON is supposed to be ascii, and so that binary output should be base64 encoded in most cases; but this would be transparent to the user and maybe not important for short arrays or single values.
Thanks! I did not find that issue ticket. That would do it. I pinged it.
Yeah, the VLenUTF8 codec is more verbose for non-scalars. For scalars, interestingly enough, it looks like you actually save some bytes using VLenUTF8 because the JSON codec filter description in .zarray is a bit verbose.