silx [silx.io.dictdump] h5todict does not keep original types

Hi, I currently use dicttoh5 and h5todict to easily store and restore dictionaries to/from disk. Unfortunately, I find very annoying that h5todict returns always numpy.ndarray types, no matter is the original saved type. My expected behavior would be that h5todict should try restoring the same type of data, at least for simple strings. Here a minimal working example:

import tempfile
import numpy as np
from silx.io.dictdump import dicttoh5, h5todict
a = "my string"
b = np.arange(10)
ftmp = tempfile.mktemp(prefix='test', suffix='.h5')
dout = dict(a=a, b=b)
dicttoh5(d, ftmp)
print(f"data saved to disk ({ftmp})")
print(f"a: {type(dout['a'])},\nb: {type(dout['b'])}")
dback = h5todict(ftmp) 
print(f"data read back from disk")
print(f"a: {type(dback['a'])},\nb: {type(dback['b'])}")

with output

data saved to disk (/tmp/testba3k_ofy.h5)
a: <class 'str'>,
b: <class 'numpy.ndarray'>
data read back from disk
a: <class 'numpy.ndarray'>,
b: <class 'numpy.ndarray'>

If I read directly the HDF5 file with silx viewer, the sting type is read correctly

Is there a way to improve the behavior of h5todict? or should I prefer JSON over HDF5 dumping?

Jul 29 '22 06:07 maurov

Is there a way to improve the behavior of h5todict? or should I prefer JSON over HDF5 dumping?

It seems there is a systematic call to this function in dictdump.py

def _prepare_hdf5_write_value(array_like):
    """Cast a python object into a numpy array in a HDF5 friendly format.
    :param array_like: Input dataset in a type that can be digested by
        ``numpy.array()`` (`str`, `list`, `numpy.ndarray`…)
    :return: ``numpy.ndarray`` ready to be written as an HDF5 dataset
    """
    array = numpy.asarray(array_like)
    if numpy.issubdtype(array.dtype, numpy.bytes_):
        return numpy.array(array_like, dtype=vlen_bytes)
    elif numpy.issubdtype(array.dtype, numpy.str_):
        return numpy.array(array_like, dtype=vlen_utf8)
    else:
        return array

Jul 29 '22 08:07 vasole

@vasole the data are stored correctly into the HDF5:

import h5py
ftmp = "/tmp/testba3k_ofy.h5"
with h5py.File(ftmp) as h5:
    for k, v in h5.items():
        print(f"{k}: {type(v[()])}")

which gives:

a: <class 'bytes'>
b: <class 'numpy.ndarray'>

Is reading back that is wrong, in my opinion

Jul 29 '22 09:07 maurov

h5py do not allow anymore to read strings transparently. That is why i guess you have this behaviour.

But this could be patched by h5todict. It's a cost, but this API was not designed for efficiency anyway.

Jul 29 '22 09:07 vallsv

h5py do not allow anymore to read strings transparently. That is why i guess you have this behavior.

The way h5py reads strings is fine to me and I expect h5todict acting the same. I do not see the point of converting everything to numpy.ndarray. In fact, having a bytes type for a string is fine and very simple to decode, while having numpy.ndarray for a string is difficult to know if it should be converted to a string or not.

Could h5todict return the same types as h5py does?

Jul 29 '22 09:07 maurov

Here is how you are supposed to write and read string with h5py. Notice i dont know why the type is |O, it means there is a problem somehow.

In [30]: import h5py
    ...: import numpy
    ...: with h5py.File("foo", "w") as f:
    ...:     utf8_type = h5py.string_dtype('utf-8')
    ...:     f["aaa"] = numpy.array(u"bbb", dtype=utf8_type)
    ...:     print(f["aaa"])
    ...:     print(f["aaa"].asstr()[()])  # python string
    ...:     print(f["aaa"][()])  # raw data
    ...: 
<HDF5 dataset "aaa": shape (), type "|O">
bbb
b'bbb'

Jul 29 '22 12:07 vallsv

So the same way you get the h5py API. See that you have to use [()], same as h5py.

In [46]: import tempfile
    ...: import numpy as np
    ...: from silx.io.dictdump import dicttoh5, h5todict
    ...: a = "my string"
    ...: b = np.arange(10)
    ...: ftmp = tempfile.mktemp(prefix='test', suffix='.h5')
    ...: dout = dict(a=a, b=b)
    ...: dicttoh5(dout, ftmp)
    ...: print(f"data saved to disk ({ftmp})")
    ...: print(f"a: {type(dout['a'])},\nb: {type(dout['b'])}")
    ...: dback = h5todict(ftmp)
    ...: print(f"data read back from disk")
    ...: print(f"a: {dback['a'][()]} ({type(dback['a'][()])})")

data saved to disk (/tmp/testk05oc16v.h5)
a: <class 'str'>,
b: <class 'numpy.ndarray'>
data read back from disk
a: my string (<class 'numpy.str_'>)

Jul 29 '22 12:07 vallsv

@vallsv thanks for the tip! I did not know that using [()] would return a numpy.str type. I am using the following workaround now:

import tempfile
import numpy as np
from silx.io.dictdump import dicttoh5, h5todict
a = "my string"
b = np.arange(10)
ftmp = tempfile.mktemp(prefix='test', suffix='.h5')
dout = dict(a=a, b=b)
dicttoh5(dout, ftmp)
print(f"data saved to disk ({ftmp})")
print(f"a: {type(dout['a'])},\nb: {type(dout['b'])}")
dback = h5todict(ftmp)
for k, v in dback.items():
    if isinstance(v[()], np.str):
        dback[k] = np.array_str(v)
print(f"data read back from disk")
print(f"a: {type(dback['a'])},\nb: {type(dback['b'])}")

and it works as expected:

data saved to disk (/tmp/tests0o__swl.h5)
a: <class 'str'>,
b: <class 'numpy.ndarray'>
data read back from disk
a: <class 'str'>,
b: <class 'numpy.ndarray'>

this trick solves my problem. Please, feel free to close the issue if no action is required on h5todict.

Jul 29 '22 14:07 maurov

@vasole @vallsv My final solution to correctly restore str and float when loading a dict with h5todict is:

import copy
import numpy as np
def _restore_from_array(dictin):
    """restore str/float from a nested dictionary of numpy.ndarray"""
    for k, v in dictin.items():
        if isinstance(v, dict):
            _restore_from_array(v)
        else:
            if isinstance(v[()], np.str):
                dictin[k] = np.array_str(v)
            if isinstance(v[()], np.float):
                dictin[k] = copy.deepcopy(v.item())

I think it would be very beneficial for every user to include something like this (probably coded more efficiently) directly in h5todict.

Jul 30 '22 05:07 maurov

h5todict has an asarray argument (True by default) and I think what you are after here is to use asarray=False: https://github.com/silx-kit/silx/blob/a2cc33e0331fe409bc35a697f17395d074fd517c/src/silx/io/dictdump.py#L569-L570

It was added in PR #2692. asarray is True by default to keep the behavior compatible with previous versions.

By using h5todict(..., asarray=False), you get:

In [35]: import tempfile 
    ...: import numpy as np 
    ...: from silx.io.dictdump import dicttoh5, h5todict 
    ...: a = "my string" 
    ...: b = np.arange(10) 
    ...: ftmp = tempfile.mktemp(prefix='test', suffix='.h5') 
    ...: dout = dict(a=a, b=b) 
    ...: dicttoh5(dout, ftmp) 
    ...: print(f"data saved to disk ({ftmp})") 
    ...: print(f"a: {type(dout['a'])},\nb: {type(dout['b'])}") 
    ...: dback = h5todict(ftmp, asarray=False)  
    ...: print(f"data read back from disk") 
    ...: print(f"a: {type(dback['a'])},\nb: {type(dback['b'])}")                                                                       
data saved to disk (/tmp/testqwetp5th.h5)
a: <class 'str'>,
b: <class 'numpy.ndarray'>
data read back from disk
a: <class 'str'>,
b: <class 'numpy.ndarray'>

Sep 12 '22 11:09 t20100

@t20100 thank you very much indeed for your help. Yes, indeed using asarray=False solves this issue.

Sep 12 '22 12:09 maurov

silx silx copied to clipboard

[silx.io.dictdump] h5todict does not keep original types

silx
silx copied to clipboard