silx icon indicating copy to clipboard operation
silx copied to clipboard

[silx.io.dictdump] h5todict does not keep original types

Open maurov opened this issue 2 years ago • 8 comments

Hi, I currently use dicttoh5 and h5todict to easily store and restore dictionaries to/from disk. Unfortunately, I find very annoying that h5todict returns always numpy.ndarray types, no matter is the original saved type. My expected behavior would be that h5todict should try restoring the same type of data, at least for simple strings. Here a minimal working example:

import tempfile
import numpy as np
from silx.io.dictdump import dicttoh5, h5todict
a = "my string"
b = np.arange(10)
ftmp = tempfile.mktemp(prefix='test', suffix='.h5')
dout = dict(a=a, b=b)
dicttoh5(d, ftmp)
print(f"data saved to disk ({ftmp})")
print(f"a: {type(dout['a'])},\nb: {type(dout['b'])}")
dback = h5todict(ftmp) 
print(f"data read back from disk")
print(f"a: {type(dback['a'])},\nb: {type(dback['b'])}")

with output

data saved to disk (/tmp/testba3k_ofy.h5)
a: <class 'str'>,
b: <class 'numpy.ndarray'>
data read back from disk
a: <class 'numpy.ndarray'>,
b: <class 'numpy.ndarray'>

If I read directly the HDF5 file with silx viewer, the sting type is read correctly

image

Is there a way to improve the behavior of h5todict? or should I prefer JSON over HDF5 dumping?

maurov avatar Jul 29 '22 06:07 maurov

Is there a way to improve the behavior of h5todict? or should I prefer JSON over HDF5 dumping?

It seems there is a systematic call to this function in dictdump.py

def _prepare_hdf5_write_value(array_like):
    """Cast a python object into a numpy array in a HDF5 friendly format.
    :param array_like: Input dataset in a type that can be digested by
        ``numpy.array()`` (`str`, `list`, `numpy.ndarray`…)
    :return: ``numpy.ndarray`` ready to be written as an HDF5 dataset
    """
    array = numpy.asarray(array_like)
    if numpy.issubdtype(array.dtype, numpy.bytes_):
        return numpy.array(array_like, dtype=vlen_bytes)
    elif numpy.issubdtype(array.dtype, numpy.str_):
        return numpy.array(array_like, dtype=vlen_utf8)
    else:
        return array

vasole avatar Jul 29 '22 08:07 vasole

@vasole the data are stored correctly into the HDF5:

import h5py
ftmp = "/tmp/testba3k_ofy.h5"
with h5py.File(ftmp) as h5:
    for k, v in h5.items():
        print(f"{k}: {type(v[()])}")

which gives:

a: <class 'bytes'>
b: <class 'numpy.ndarray'>

Is reading back that is wrong, in my opinion

maurov avatar Jul 29 '22 09:07 maurov

h5py do not allow anymore to read strings transparently. That is why i guess you have this behaviour.

But this could be patched by h5todict. It's a cost, but this API was not designed for efficiency anyway.

vallsv avatar Jul 29 '22 09:07 vallsv

h5py do not allow anymore to read strings transparently. That is why i guess you have this behavior.

The way h5py reads strings is fine to me and I expect h5todict acting the same. I do not see the point of converting everything to numpy.ndarray. In fact, having a bytes type for a string is fine and very simple to decode, while having numpy.ndarray for a string is difficult to know if it should be converted to a string or not.

Could h5todict return the same types as h5py does?

maurov avatar Jul 29 '22 09:07 maurov

Here is how you are supposed to write and read string with h5py. Notice i dont know why the type is |O, it means there is a problem somehow.

In [30]: import h5py
    ...: import numpy
    ...: with h5py.File("foo", "w") as f:
    ...:     utf8_type = h5py.string_dtype('utf-8')
    ...:     f["aaa"] = numpy.array(u"bbb", dtype=utf8_type)
    ...:     print(f["aaa"])
    ...:     print(f["aaa"].asstr()[()])  # python string
    ...:     print(f["aaa"][()])  # raw data
    ...: 
<HDF5 dataset "aaa": shape (), type "|O">
bbb
b'bbb'

vallsv avatar Jul 29 '22 12:07 vallsv

So the same way you get the h5py API. See that you have to use [()], same as h5py.

In [46]: import tempfile
    ...: import numpy as np
    ...: from silx.io.dictdump import dicttoh5, h5todict
    ...: a = "my string"
    ...: b = np.arange(10)
    ...: ftmp = tempfile.mktemp(prefix='test', suffix='.h5')
    ...: dout = dict(a=a, b=b)
    ...: dicttoh5(dout, ftmp)
    ...: print(f"data saved to disk ({ftmp})")
    ...: print(f"a: {type(dout['a'])},\nb: {type(dout['b'])}")
    ...: dback = h5todict(ftmp)
    ...: print(f"data read back from disk")
    ...: print(f"a: {dback['a'][()]} ({type(dback['a'][()])})")

data saved to disk (/tmp/testk05oc16v.h5)
a: <class 'str'>,
b: <class 'numpy.ndarray'>
data read back from disk
a: my string (<class 'numpy.str_'>)

vallsv avatar Jul 29 '22 12:07 vallsv

@vallsv thanks for the tip! I did not know that using [()] would return a numpy.str type. I am using the following workaround now:

import tempfile
import numpy as np
from silx.io.dictdump import dicttoh5, h5todict
a = "my string"
b = np.arange(10)
ftmp = tempfile.mktemp(prefix='test', suffix='.h5')
dout = dict(a=a, b=b)
dicttoh5(dout, ftmp)
print(f"data saved to disk ({ftmp})")
print(f"a: {type(dout['a'])},\nb: {type(dout['b'])}")
dback = h5todict(ftmp)
for k, v in dback.items():
    if isinstance(v[()], np.str):
        dback[k] = np.array_str(v)
print(f"data read back from disk")
print(f"a: {type(dback['a'])},\nb: {type(dback['b'])}")

and it works as expected:

data saved to disk (/tmp/tests0o__swl.h5)
a: <class 'str'>,
b: <class 'numpy.ndarray'>
data read back from disk
a: <class 'str'>,
b: <class 'numpy.ndarray'>

this trick solves my problem. Please, feel free to close the issue if no action is required on h5todict.

maurov avatar Jul 29 '22 14:07 maurov

@vasole @vallsv My final solution to correctly restore str and float when loading a dict with h5todict is:

import copy
import numpy as np
def _restore_from_array(dictin):
    """restore str/float from a nested dictionary of numpy.ndarray"""
    for k, v in dictin.items():
        if isinstance(v, dict):
            _restore_from_array(v)
        else:
            if isinstance(v[()], np.str):
                dictin[k] = np.array_str(v)
            if isinstance(v[()], np.float):
                dictin[k] = copy.deepcopy(v.item())

I think it would be very beneficial for every user to include something like this (probably coded more efficiently) directly in h5todict.

maurov avatar Jul 30 '22 05:07 maurov

h5todict has an asarray argument (True by default) and I think what you are after here is to use asarray=False: https://github.com/silx-kit/silx/blob/a2cc33e0331fe409bc35a697f17395d074fd517c/src/silx/io/dictdump.py#L569-L570

It was added in PR #2692. asarray is True by default to keep the behavior compatible with previous versions.

By using h5todict(..., asarray=False), you get:

In [35]: import tempfile 
    ...: import numpy as np 
    ...: from silx.io.dictdump import dicttoh5, h5todict 
    ...: a = "my string" 
    ...: b = np.arange(10) 
    ...: ftmp = tempfile.mktemp(prefix='test', suffix='.h5') 
    ...: dout = dict(a=a, b=b) 
    ...: dicttoh5(dout, ftmp) 
    ...: print(f"data saved to disk ({ftmp})") 
    ...: print(f"a: {type(dout['a'])},\nb: {type(dout['b'])}") 
    ...: dback = h5todict(ftmp, asarray=False)  
    ...: print(f"data read back from disk") 
    ...: print(f"a: {type(dback['a'])},\nb: {type(dback['b'])}")                                                                       
data saved to disk (/tmp/testqwetp5th.h5)
a: <class 'str'>,
b: <class 'numpy.ndarray'>
data read back from disk
a: <class 'str'>,
b: <class 'numpy.ndarray'>

t20100 avatar Sep 12 '22 11:09 t20100

@t20100 thank you very much indeed for your help. Yes, indeed using asarray=False solves this issue.

maurov avatar Sep 12 '22 12:09 maurov