zarr-python icon indicating copy to clipboard operation
zarr-python copied to clipboard

can't have numpy datatypes in attributes

Open rabernat opened this issue 7 years ago • 11 comments

We are working on the zarr backend for XArray (pydata/xarray#1528). XArray likes to put all kinds of weird stuff into attributes, including numpy datatypes and even numpy arrays. This is because the netCDF data model allows attributes to have all of the same types as variables.

Instead, in zarr, the attributes have to be json-serializable. So this doesn't work:

za = zarr.create(shape=(1), store='tmp_file')
za.attrs['foo'] = np.float32(0)

It raises TypeError: Object of type 'float32' is not JSON serializable.

We will need some sort of workaround for this in order to make zarr work as a store for xarray.

rabernat avatar Oct 08 '17 04:10 rabernat

Thanks for raising, following the xarray work with interest. Are there any object types other than numpy dtype that would need special handling when going to/from JSON?

On Sun, 8 Oct 2017 at 05:24, Ryan Abernathey [email protected] wrote:

We are working on the zarr backend for XArray (pydata/xarray#1528 https://github.com/pydata/xarray/pull/1528). XArray likes to put all kinds of weird stuff into attributes, including numpy datatypes and even numpy arrays. This is because the netCDF data model http://www.unidata.ucar.edu/software/netcdf/netcdf/Attributes.html allows attributes to have all of the same types as variables.

Instead, in zarr, the attributes have to be json-serializable. So this doesn't work:

za = zarr.create(shape=(1), store='tmp_file') za.attrs['foo'] = np.float32(0)

It raises TypeError: Object of type 'float32' is not JSON serializable.

We will need some sort of workaround for this in order to make zarr work as a store for xarray.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/alimanfoo/zarr/issues/156, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8Qj8wHfZ4RC1F6Eg-3Tln229IK7gPks5sqE5ygaJpZM4Pxlnb .

-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org Big Data Institute Building Old Road Campus Roosevelt Drive Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 Email: [email protected] Web: http://a http://purl.org/net/alimanlimanfoo.github.io/ Twitter: https://twitter.com/alimanfoo

alimanfoo avatar Oct 08 '17 12:10 alimanfoo

Is this still of interest, @rabernat?

jakirkham avatar Jun 13 '18 05:06 jakirkham

I am very interested in this issue. I need to store exact binary values and datetime objects as attributes. To work around the limitations of JSON, I currently encode these attributes as strings and put the burden on the consumer of the data to correctly decode them to the actual data types. This is not ideal. Ideally, any data type that is valid for an array ought to be valid for an attribute (like it is in the netCDF model).

This issue seems to be related to https://github.com/zarr-developers/zarr/issues/244 and https://github.com/zarr-developers/zarr/issues/216

One approach that might address both issues is to allow .zarray and .zattrs to use a binary serialization format (e.g. using numcodecs.MsgPack), the same way that arbitrary variable-length array elements can be encoded.

chairmank avatar Jun 21 '18 03:06 chairmank

Ideally, any data type that is valid for an array ought to be valid for an attribute (like it is in the netCDF model)

Could you please elaborate on this point a bit? What sorts of things are you imagining storing here?

jakirkham avatar Jun 21 '18 04:06 jakirkham

I would like to store as attributes any of the data types described in the "Data Type Encoding" section of the Zarr specification.

Specifically, in my real-world usage, I have encountered inconvenience with attribute values that are

  • datetime64 and timedelta
  • Floating-point numbers that I need to represent with exact precision (e.g. f8 versus f4), which JSON doesn't distinguish
    • A special problem is NaN, which has an exact representation as a Zarr/NumPy floating-point value but can not be represented by JSON
  • Structs like [('R','u1'), ('G','u1'), ('B','u1'), ('A','u1')]

I am also excited by the possibility of storing attributes that are arbitrary objects, such as JSON documents, although I haven't expressly encountered this requirement yet.

It is worth noting that, in NetCDF, attribute values are really 1-dimensional arrays:

An attribute has an associated variable (the null "global variable" for a global or group-level attribute), a name, a data type, a length, and a value. The current version treats all attributes as vectors; scalar values are treated as single-element vectors.

chairmank avatar Jun 21 '18 05:06 chairmank

Sorry for the very long delay.

It is worth noting that, in NetCDF, attribute values are really 1-dimensional arrays...

This is a really great point. Though this raises the question, would the best way to represent this data be an array with attributes that are array values or would it be a group with many arrays?

jakirkham avatar Dec 04 '18 05:12 jakirkham

I'm using xarray/zarr and find the attributes usage constraining as well. I would like to suggest:

https://json-tricks.readthedocs.io/en/latest/

It uses the same api as json and solves many of the common use cases.

jewfro-cuban avatar Dec 27 '18 03:12 jewfro-cuban

Thanks @jewfro-cuban, I didn't know about json-tricks, looks nice. The encoding format seems generally very sensible, although I guess we'd want to avoid supporting arbitrary class instances as a potential security issue.

Is there a way we could just depend on json-tricks, but with __instance_type__ disabled?

alimanfoo avatar Jan 04 '19 16:01 alimanfoo

Is there a way we could just depend on json-tricks, but with __instance_type__ disabled?

We could always check if that shows up in the result and error out if so.

jakirkham avatar Nov 10 '19 00:11 jakirkham

I would like to store as attributes any of the data types described in the "Data Type Encoding" section of the Zarr specification.

Specifically, in my real-world usage, I have encountered inconvenience with attribute values that are

* `datetime64` and `timedelta`

I encountered the same problem, and I would like to add that for me it would be enough if I could pass a custom JSONDecoder to zarr. It just needs to offer that as an argument to open_group etc (see https://github.com/zarr-developers/zarr-python/pull/533/files).

nritsche avatar Apr 12 '21 22:04 nritsche

I was recently hit by this very same problem, with reference to HDF5 files, which also allow for array attributes.

For example from h5dump I have

         ATTRIBUTE "data channels" {
            DATATYPE  H5T_STD_I64LE
            DATASPACE  SIMPLE { ( 4 ) / ( 4 ) }
            DATA {
            (0): 1, 2, 3, 4
            }
         }
         ATTRIBUTE "data units" {
            DATATYPE  H5T_STRING {
               STRSIZE 8;
               STRPAD H5T_STR_NULLPAD;
               CSET H5T_CSET_ASCII;
               CTYPE H5T_C_S1;
            }
            DATASPACE  SIMPLE { ( 4 ) / ( 4 ) }
            DATA {
            (0): "N       ", "m/s^2   ", "m/s^2   ", "m/s^2   "
            }
         }

which are rendered by h5py as

>>> data.attrs['data channels']
array([1, 2, 3, 4])
>>> data.attrs['data units']
array([b'N       ', b'm/s^2   ', b'm/s^2   ', b'm/s^2   '], dtype='|S8')

When converting from HDF5 to ZARR, zarr.copy_all fails with

TypeError: Object of type ndarray is not JSON serializable

Since I have a bunch of files to convert I implemented a quick fix in miccoli/zarr-python@380ee7c07

I'm not sure if this is of general interest, but if there is enough interest I can open a PR.

Open question:

  • just hardcode the np.ndarray -> list mapping, or maybe better, allow the user to override the default JSONEncoder?

See also #933 and #533

miccoli avatar Aug 06 '22 16:08 miccoli