zarr-specs icon indicating copy to clipboard operation
zarr-specs copied to clipboard

Support for inf, nan, binary data in attributes

Open jbms opened this issue 2 years ago • 12 comments

In both v2 and v3 specs, attributes are stored as JSON which means they are subject to the limitations of JSON:

  • Infinity and NaN cannot be represented, although currently zarr-python supports these via a non-standard extension to JSON format (directly encoding them as unquoted Infinity or NaN within the JSON output). This support may be accidental rather than intentional, though.
  • Byte strings (as opposed to Unicode strings) cannot be represented.

The v3 spec suggests the possibility of using a binary encoding like CBOR for metadata which would solve these issues. However my understanding is that CBOR would not be the default.

jbms avatar May 04 '22 19:05 jbms

Also related ( https://github.com/zarr-developers/zarr-specs/issues/81 )

Somewhat related more general Zarr Python discussion ( https://github.com/zarr-developers/zarr-python/issues/216 )

jakirkham avatar Jun 15 '22 18:06 jakirkham

Here's the current code in Zarr Python for handling these values. One thought might be to graduate this to inclusion in the spec as-is. Idk if there are other approaches worth considering.

For more complex objects Base64 encoding has come up as an idea.

jakirkham avatar Jun 15 '22 19:06 jakirkham

One difficulty with the current zarr-python approach is that it means the "JSON" metadata is not actually spec-compliant JSON and cannot be parsed by the JavaScript JSON.parse function or by the popular C++ nlohmann json library. However, other libraries like RapidJSON can handle the Infinity/NaN extension.

For binary data it is certainly possible to encode it as a string containing base64 encoding, etc. However, there is a similar problem as with encoding Infinity/NaN as JSON strings --- you then lose the type information, so even if you automatically convert such values when encoding as json, there won't be a way to decode them automatically since they aren't distinguishable from ordinary strings.

In general I think it would be valuable for zarr v3 attributes to be able to "round trip" any supported zarr v3 data type, since storing a data value as an attribute is a very common use case and it would be nice for every individual use case not to have to invent its own convention.

jbms avatar Jun 15 '22 19:06 jbms

Being able to parse with normal JSON libraries makes sense.

Well the type information of the fill_value should match the data_type of the array, right?

Or is the concern the fill_value wouldn't be understood to be base64 encoded? In comment ( https://github.com/zarr-developers/zarr-python/issues/216#issuecomment-350503435 ), the syntax below was proposed. Maybe there are alternative ways to capture this?

    "fill_value": {
        "base64": "..."
    }

Are there other concerns missed here?

jakirkham avatar Jun 15 '22 19:06 jakirkham

For fill_value the data type is already known so there isn't an issue there. zarr-python for v2 already uses a different encoding for fill_value ---- infinity is encoded as "Infinity" rather than Infinity.

I think it would be nice if fill_value can be expressed in a readable way but that isn't critical.

I intended this issue to be about user-defined attributes, though. For example, a common use case might be to store the min/max range of an array, or the bucket partitions for a histogram as attributes.

jbms avatar Jun 15 '22 19:06 jbms

Ah sorry thought from our conversation earlier this was fill_value focused.

That being said, the issue is more or less the same in either case. We are storing values in JSON that may not representable using standard JSON types.

In these cases we do know how we would encode the data to write in a binary file. That same strategy could be leveraged to binary encode those values as if they were going in a chunk then convert to base64 before storing in JSON as string (possibly with an object indicating the encoding like above).

Understand the concern about human readability. With things like NaN and Infinity, we could quote them if that is sufficient. Zarr Python should be doing this currently. If it is not, that is a bug. We could also attempt JSON encoding these values in a readable way if possible. Only falling back to base64 if that is not an option.

jakirkham avatar Jun 16 '22 06:06 jakirkham

The difference between fill_value and user-defined attributes is that for fill_value, the data type is specified elsewhere in the metadata and can be used to decode whatever representation is used.

For example, we probably want to distinguish between the user storing an attribute "a" with a value of the string "Infinity" and the user storing an attribute "a" with a value of the floating-point number Infinity. If both are encoded as {"a": "Infinity"} then that is not possible.

For user-defined attributes, if we want to support arbitrary data values, then we need to decide on the data model and there needs to be a way to unambiguously encode them.

For example, we might choose a data model of JSON augmented with support for +/-Infinity, NaN, and byte strings.

If we want to stick with encoding attributes as valid JSON, we could define an escaping mechanism so that the following Python attribute dictionary:

dict(a="Infinity", b=b"some binary representation", c=float("inf")}

is encoded as the following JSON metadata representation:

{..., "attribute_escape_key": "xxx", "attributes": {"a": "Infinity", "b": {"xxx": "binary", "value": "base64 representation..."}, "c": {"xxx": "Infinity"}}

When encoding, the implementation must choose an escape key "xxx" that does not occur as an actual key in any attribute value that is a dictionary.

jbms avatar Jun 16 '22 17:06 jbms

I generally like the idea of encoding things as objects if necessary. If possible, I'd prefer to have one clear way of identifying them, e.g. {"@type": "...", ...}.

joshmoore avatar Jun 22 '22 12:06 joshmoore

Would be great to get the unidata perspective here. NetCDF supports typed attributes, so they have had to confront this already with nczarr. cc @WardF and @DennisHeimbigner.

rabernat avatar Jun 22 '22 13:06 rabernat

First, note that netcdf-c does not allow multiple occurrences of an attribute with the same name, even if the types are different. I am not sure if relevant, but NCZarr adds an extra attribute named _nczarr_attr whose value is a JSON dictionary that contains extra information about the other attributes defined by .zattr. Specifically, it contains a key called "types" maps each attribute name to a type such as "<i4".

DennisHeimbigner avatar Jun 22 '22 19:06 DennisHeimbigner

Good discussion this evening around this topic during the community meeting. Present company seems to be leaning towards a JSON-object-style encoding of one form or another. See the notes for more.

joshmoore avatar Sep 07 '22 19:09 joshmoore

I'm proposing to go forward with JSON encoding as the default for v3 for now, and explicitly allowing arbitrary JSON entries for a given key in the user-attributes, which itself must be a string. This is effectively the same as the spec already defines since JSON object literals must have strings as keys. See #173.

This would currently not allow good representations of inf/nan or bytestrings in user attributes. (As mentioned above, there's special handling for fill-values). If somebody would create a PR to add this to the current spec I think it would be valid to include it.

Alternative encodings to JSON may be added by extensions later, especially if user attributes may be stored in a separate document (see #72).

Just as a side-note: yaml would both support inf/nan/binary, as well as custom type definitions. Might be a useful extension, and IMO more human-readable than CBOR

jstriebel avatar Nov 23 '22 13:11 jstriebel