silx
silx copied to clipboard
[silx.io.dictdump] h5todict does not keep original types
Hi, I currently use dicttoh5
and h5todict
to easily store and restore dictionaries to/from disk. Unfortunately, I find very annoying that h5todict
returns always numpy.ndarray
types, no matter is the original saved type. My expected behavior would be that h5todict
should try restoring the same type of data, at least for simple strings. Here a minimal working example:
import tempfile
import numpy as np
from silx.io.dictdump import dicttoh5, h5todict
a = "my string"
b = np.arange(10)
ftmp = tempfile.mktemp(prefix='test', suffix='.h5')
dout = dict(a=a, b=b)
dicttoh5(d, ftmp)
print(f"data saved to disk ({ftmp})")
print(f"a: {type(dout['a'])},\nb: {type(dout['b'])}")
dback = h5todict(ftmp)
print(f"data read back from disk")
print(f"a: {type(dback['a'])},\nb: {type(dback['b'])}")
with output
data saved to disk (/tmp/testba3k_ofy.h5)
a: <class 'str'>,
b: <class 'numpy.ndarray'>
data read back from disk
a: <class 'numpy.ndarray'>,
b: <class 'numpy.ndarray'>
If I read directly the HDF5 file with silx viewer, the sting type is read correctly
Is there a way to improve the behavior of h5todict
? or should I prefer JSON over HDF5 dumping?
Is there a way to improve the behavior of
h5todict
? or should I prefer JSON over HDF5 dumping?
It seems there is a systematic call to this function in dictdump.py
def _prepare_hdf5_write_value(array_like):
"""Cast a python object into a numpy array in a HDF5 friendly format.
:param array_like: Input dataset in a type that can be digested by
``numpy.array()`` (`str`, `list`, `numpy.ndarray`…)
:return: ``numpy.ndarray`` ready to be written as an HDF5 dataset
"""
array = numpy.asarray(array_like)
if numpy.issubdtype(array.dtype, numpy.bytes_):
return numpy.array(array_like, dtype=vlen_bytes)
elif numpy.issubdtype(array.dtype, numpy.str_):
return numpy.array(array_like, dtype=vlen_utf8)
else:
return array
@vasole the data are stored correctly into the HDF5:
import h5py
ftmp = "/tmp/testba3k_ofy.h5"
with h5py.File(ftmp) as h5:
for k, v in h5.items():
print(f"{k}: {type(v[()])}")
which gives:
a: <class 'bytes'>
b: <class 'numpy.ndarray'>
Is reading back that is wrong, in my opinion
h5py
do not allow anymore to read strings transparently.
That is why i guess you have this behaviour.
But this could be patched by h5todict
. It's a cost, but this API was not designed for efficiency anyway.
h5py
do not allow anymore to read strings transparently. That is why i guess you have this behavior.
The way h5py
reads strings is fine to me and I expect h5todict
acting the same. I do not see the point of converting everything to numpy.ndarray
. In fact, having a bytes
type for a string is fine and very simple to decode, while having numpy.ndarray
for a string is difficult to know if it should be converted to a string or not.
Could h5todict
return the same types as h5py
does?
Here is how you are supposed to write and read string with h5py
. Notice i dont know why the type is |O
, it means there is a problem somehow.
In [30]: import h5py
...: import numpy
...: with h5py.File("foo", "w") as f:
...: utf8_type = h5py.string_dtype('utf-8')
...: f["aaa"] = numpy.array(u"bbb", dtype=utf8_type)
...: print(f["aaa"])
...: print(f["aaa"].asstr()[()]) # python string
...: print(f["aaa"][()]) # raw data
...:
<HDF5 dataset "aaa": shape (), type "|O">
bbb
b'bbb'
So the same way you get the h5py
API. See that you have to use [()]
, same as h5py
.
In [46]: import tempfile
...: import numpy as np
...: from silx.io.dictdump import dicttoh5, h5todict
...: a = "my string"
...: b = np.arange(10)
...: ftmp = tempfile.mktemp(prefix='test', suffix='.h5')
...: dout = dict(a=a, b=b)
...: dicttoh5(dout, ftmp)
...: print(f"data saved to disk ({ftmp})")
...: print(f"a: {type(dout['a'])},\nb: {type(dout['b'])}")
...: dback = h5todict(ftmp)
...: print(f"data read back from disk")
...: print(f"a: {dback['a'][()]} ({type(dback['a'][()])})")
data saved to disk (/tmp/testk05oc16v.h5)
a: <class 'str'>,
b: <class 'numpy.ndarray'>
data read back from disk
a: my string (<class 'numpy.str_'>)
@vallsv thanks for the tip! I did not know that using [()]
would return a numpy.str
type. I am using the following workaround now:
import tempfile
import numpy as np
from silx.io.dictdump import dicttoh5, h5todict
a = "my string"
b = np.arange(10)
ftmp = tempfile.mktemp(prefix='test', suffix='.h5')
dout = dict(a=a, b=b)
dicttoh5(dout, ftmp)
print(f"data saved to disk ({ftmp})")
print(f"a: {type(dout['a'])},\nb: {type(dout['b'])}")
dback = h5todict(ftmp)
for k, v in dback.items():
if isinstance(v[()], np.str):
dback[k] = np.array_str(v)
print(f"data read back from disk")
print(f"a: {type(dback['a'])},\nb: {type(dback['b'])}")
and it works as expected:
data saved to disk (/tmp/tests0o__swl.h5)
a: <class 'str'>,
b: <class 'numpy.ndarray'>
data read back from disk
a: <class 'str'>,
b: <class 'numpy.ndarray'>
this trick solves my problem. Please, feel free to close the issue if no action is required on h5todict
.
@vasole @vallsv My final solution to correctly restore str
and float
when loading a dict
with h5todict
is:
import copy
import numpy as np
def _restore_from_array(dictin):
"""restore str/float from a nested dictionary of numpy.ndarray"""
for k, v in dictin.items():
if isinstance(v, dict):
_restore_from_array(v)
else:
if isinstance(v[()], np.str):
dictin[k] = np.array_str(v)
if isinstance(v[()], np.float):
dictin[k] = copy.deepcopy(v.item())
I think it would be very beneficial for every user to include something like this (probably coded more efficiently) directly in h5todict
.
h5todict
has an asarray
argument (True
by default) and I think what you are after here is to use asarray=False
:
https://github.com/silx-kit/silx/blob/a2cc33e0331fe409bc35a697f17395d074fd517c/src/silx/io/dictdump.py#L569-L570
It was added in PR #2692. asarray
is True
by default to keep the behavior compatible with previous versions.
By using h5todict(..., asarray=False)
, you get:
In [35]: import tempfile
...: import numpy as np
...: from silx.io.dictdump import dicttoh5, h5todict
...: a = "my string"
...: b = np.arange(10)
...: ftmp = tempfile.mktemp(prefix='test', suffix='.h5')
...: dout = dict(a=a, b=b)
...: dicttoh5(dout, ftmp)
...: print(f"data saved to disk ({ftmp})")
...: print(f"a: {type(dout['a'])},\nb: {type(dout['b'])}")
...: dback = h5todict(ftmp, asarray=False)
...: print(f"data read back from disk")
...: print(f"a: {type(dback['a'])},\nb: {type(dback['b'])}")
data saved to disk (/tmp/testqwetp5th.h5)
a: <class 'str'>,
b: <class 'numpy.ndarray'>
data read back from disk
a: <class 'str'>,
b: <class 'numpy.ndarray'>
@t20100 thank you very much indeed for your help. Yes, indeed using asarray=False
solves this issue.