zarr-python
zarr-python copied to clipboard
can't have numpy datatypes in attributes
We are working on the zarr backend for XArray (pydata/xarray#1528). XArray likes to put all kinds of weird stuff into attributes, including numpy datatypes and even numpy arrays. This is because the netCDF data model allows attributes to have all of the same types as variables.
Instead, in zarr, the attributes have to be json-serializable. So this doesn't work:
za = zarr.create(shape=(1), store='tmp_file')
za.attrs['foo'] = np.float32(0)
It raises TypeError: Object of type 'float32' is not JSON serializable
.
We will need some sort of workaround for this in order to make zarr work as a store for xarray.
Thanks for raising, following the xarray work with interest. Are there any object types other than numpy dtype that would need special handling when going to/from JSON?
On Sun, 8 Oct 2017 at 05:24, Ryan Abernathey [email protected] wrote:
We are working on the zarr backend for XArray (pydata/xarray#1528 https://github.com/pydata/xarray/pull/1528). XArray likes to put all kinds of weird stuff into attributes, including numpy datatypes and even numpy arrays. This is because the netCDF data model http://www.unidata.ucar.edu/software/netcdf/netcdf/Attributes.html allows attributes to have all of the same types as variables.
Instead, in zarr, the attributes have to be json-serializable. So this doesn't work:
za = zarr.create(shape=(1), store='tmp_file') za.attrs['foo'] = np.float32(0)
It raises TypeError: Object of type 'float32' is not JSON serializable.
We will need some sort of workaround for this in order to make zarr work as a store for xarray.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/alimanfoo/zarr/issues/156, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8Qj8wHfZ4RC1F6Eg-3Tln229IK7gPks5sqE5ygaJpZM4Pxlnb .
-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org Big Data Institute Building Old Road Campus Roosevelt Drive Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 Email: [email protected] Web: http://a http://purl.org/net/alimanlimanfoo.github.io/ Twitter: https://twitter.com/alimanfoo
Is this still of interest, @rabernat?
I am very interested in this issue. I need to store exact binary values and datetime objects as attributes. To work around the limitations of JSON, I currently encode these attributes as strings and put the burden on the consumer of the data to correctly decode them to the actual data types. This is not ideal. Ideally, any data type that is valid for an array ought to be valid for an attribute (like it is in the netCDF model).
This issue seems to be related to https://github.com/zarr-developers/zarr/issues/244 and https://github.com/zarr-developers/zarr/issues/216
One approach that might address both issues is to allow .zarray
and .zattrs
to use a binary serialization format (e.g. using numcodecs.MsgPack
), the same way that arbitrary variable-length array elements can be encoded.
Ideally, any data type that is valid for an array ought to be valid for an attribute (like it is in the netCDF model)
Could you please elaborate on this point a bit? What sorts of things are you imagining storing here?
I would like to store as attributes any of the data types described in the "Data Type Encoding" section of the Zarr specification.
Specifically, in my real-world usage, I have encountered inconvenience with attribute values that are
-
datetime64
andtimedelta
- Floating-point numbers that I need to represent with exact precision (e.g.
f8
versusf4
), which JSON doesn't distinguish- A special problem is
NaN
, which has an exact representation as a Zarr/NumPy floating-point value but can not be represented by JSON
- A special problem is
- Structs like
[('R','u1'), ('G','u1'), ('B','u1'), ('A','u1')]
I am also excited by the possibility of storing attributes that are arbitrary objects, such as JSON documents, although I haven't expressly encountered this requirement yet.
It is worth noting that, in NetCDF, attribute values are really 1-dimensional arrays:
An attribute has an associated variable (the null "global variable" for a global or group-level attribute), a name, a data type, a length, and a value. The current version treats all attributes as vectors; scalar values are treated as single-element vectors.
Sorry for the very long delay.
It is worth noting that, in NetCDF, attribute values are really 1-dimensional arrays...
This is a really great point. Though this raises the question, would the best way to represent this data be an array with attributes that are array values or would it be a group with many arrays?
I'm using xarray/zarr and find the attributes usage constraining as well. I would like to suggest:
https://json-tricks.readthedocs.io/en/latest/
It uses the same api as json and solves many of the common use cases.
Thanks @jewfro-cuban, I didn't know about json-tricks, looks nice. The encoding format seems generally very sensible, although I guess we'd want to avoid supporting arbitrary class instances as a potential security issue.
Is there a way we could just depend on json-tricks, but with __instance_type__
disabled?
Is there a way we could just depend on json-tricks, but with
__instance_type__
disabled?
We could always check if that shows up in the result and error out if so.
I would like to store as attributes any of the data types described in the "Data Type Encoding" section of the Zarr specification.
Specifically, in my real-world usage, I have encountered inconvenience with attribute values that are
* `datetime64` and `timedelta`
I encountered the same problem, and I would like to add that for me it would be enough if I could pass a custom JSONDecoder
to zarr. It just needs to offer that as an argument to open_group
etc (see https://github.com/zarr-developers/zarr-python/pull/533/files).
I was recently hit by this very same problem, with reference to HDF5 files, which also allow for array attributes.
For example from h5dump
I have
ATTRIBUTE "data channels" {
DATATYPE H5T_STD_I64LE
DATASPACE SIMPLE { ( 4 ) / ( 4 ) }
DATA {
(0): 1, 2, 3, 4
}
}
ATTRIBUTE "data units" {
DATATYPE H5T_STRING {
STRSIZE 8;
STRPAD H5T_STR_NULLPAD;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SIMPLE { ( 4 ) / ( 4 ) }
DATA {
(0): "N ", "m/s^2 ", "m/s^2 ", "m/s^2 "
}
}
which are rendered by h5py
as
>>> data.attrs['data channels']
array([1, 2, 3, 4])
>>> data.attrs['data units']
array([b'N ', b'm/s^2 ', b'm/s^2 ', b'm/s^2 '], dtype='|S8')
When converting from HDF5 to ZARR, zarr.copy_all
fails with
TypeError: Object of type ndarray is not JSON serializable
Since I have a bunch of files to convert I implemented a quick fix in miccoli/zarr-python@380ee7c07
I'm not sure if this is of general interest, but if there is enough interest I can open a PR.
Open question:
- just hardcode the
np.ndarray -> list
mapping, or maybe better, allow the user to override the default JSONEncoder?
See also #933 and #533