zarr-specs
zarr-specs copied to clipboard
Non-JSON metadata and attributes
As briefly discussed in the group chat, I would like to propose a change to how metadata and attributes are accessed. The current spec is specific that this data must be readable and writable as JSON. This is compatible with all current storage backends of Zarr and the filesystem and cloud storage backends of N5. It is not compatible with the current HDF5 backend of N5 where attributes and metadata are represented as HDF5 attributes. Instead of requiring JSON, I suggest that metadata and attribute access should be specified similar to the group and array access protocol of the spec, i.e. as access primitives, i.e. API. The most basic primitives would be:
getAttribute
- Retrieve the value
associated with a given key
and attributeKey
.
| Parameters: `key`, `attributeKey`, [`type`]
| Output: `value`
setAttribute
- Store a (key
, attributeKey
, value
) triple.
| Parameters: `key`, `attributeKey`, `value`
| Output: none
Probably also something to list attributes and may be infer their types if necessary. The N5 API does it this way and I find it very straight forward to use this across JSON and non-JSON backends
https://github.com/saalfeldlab/n5/blob/master/src/main/java/org/janelia/saalfeldlab/n5/N5Reader.java#L214
https://github.com/saalfeldlab/n5/blob/master/src/main/java/org/janelia/saalfeldlab/n5/N5Reader.java#L271
https://github.com/saalfeldlab/n5/blob/master/src/main/java/org/janelia/saalfeldlab/n5/N5Writer.java#L43
https://github.com/saalfeldlab/n5/blob/master/src/main/java/org/janelia/saalfeldlab/n5/N5Writer.java#L59
and the default JSON implementation which is only bloated to support version 0 with non auto-inferred compressors
https://github.com/saalfeldlab/n5/blob/master/src/main/java/org/janelia/saalfeldlab/n5/AbstractGsonReader.java
Hi Stephan,
One quick question. Currently the metadata in the v3.0 spec is not just a flat set of name/value pairs, where values are simple types like string or number. Some parts of the metadata require nesting, meaning that the value is a JSON object or array of objects. E.g., the value of chunk_grid is an object, and the value of chunk_codecs is an array of objects. How would you accommodate this if using HDF5 attributes to store metadata?
On Thu, 23 May 2019 at 22:51, Stephan Saalfeld [email protected] wrote:
As briefly discussed in the group chat, I would like to propose a change to how metadata and attributes are accessed. The current spec is specific that this data must be readable and writable as JSON. This is compatible with all current storage backends of Zarr and the filesystem and cloud storage backends of N5. It is not compatible with the current HDF5 backend of N5 where attributes and metadata are represented as HDF5 attributes. Instead of requiring JSON, I suggest that metadata and attribute access should be specified similar to the group and array access protocol of the spec, i.e. as access primitives, i.e. API. The most basic primitives would be:
getAttribute - Retrieve the value associated with a given key and attributeKey.
| Parameters:
key
,attributeKey
, [type
] | Output:value
setAttribute - Store a (key, value) pair.
| Parameters:
key
,attributeKey
,value
| Output: noneProbably also something to list attributes and may be infer their types if necessary. The N5 API does it this way and I find it very straight forward to use this across JSON and non-JSON backends
https://github.com/saalfeldlab/n5/blob/master/src/main/java/org/janelia/saalfeldlab/n5/N5Reader.java#L214
https://github.com/saalfeldlab/n5/blob/master/src/main/java/org/janelia/saalfeldlab/n5/N5Reader.java#L271
https://github.com/saalfeldlab/n5/blob/master/src/main/java/org/janelia/saalfeldlab/n5/N5Writer.java#L43
https://github.com/saalfeldlab/n5/blob/master/src/main/java/org/janelia/saalfeldlab/n5/N5Writer.java#L59
and the default JSON implementation which is only bloated to support version 0 with non auto-inferred compressors
https://github.com/saalfeldlab/n5/blob/master/src/main/java/org/janelia/saalfeldlab/n5/AbstractGsonReader.java
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/zarr-developers/zarr-specs/issues/37?email_source=notifications&email_token=AAFLYQUJRQHO4MGLPOJJ7HTPW4GXZA5CNFSM4HPKSPK2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4GVSIBLA, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFLYQVIZJVWX5Z4RXXZGLDPW4GXZANCNFSM4HPKSPKQ .
--
Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health Big Data Institute Li Ka Shing Centre for Health Information and Discovery University of Oxford Old Road Campus Headington Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 or +44 (0)7866 541624 Email: [email protected] Web: http://a http://purl.org/net/alimanlimanfoo.github.io/ Twitter: @alimanfoo https://twitter.com/alimanfoo
Please feel free to resend your email and/or contact me by other means if you need an urgent reply.
Just to generalise a bit, I think there are two possible sets of requirements here:
(1) If/how to support storage implementations which have some "native" mechanism for storing metadata (e.g., N5's HDF5 backend).
(2) If/how to support alternative encodings of metadata (e.g., MessagePack instead of JSON).
In terms of the v3.0 core protocol, do we try to create a framework that can accommodate either of these requirements, if so how? This might mean just providing the right foundation to allow protocol extensions to address them, rather than fully addressing them within the core protocol.
My current thinking (to prevent storing a dataset of XML) was to convert OME-XML to the upcoming OME-JSON-LD and put that in the block of metadata. Either a hierarchical JSON tree would work, or a set of triples could represent the underlying RDF. Depending on allowed keys, it's conceivable that one could map the Subject and the Predicate into a single key but it won't be attractive:
"@id" : "arc:arc0",
"@type" : [ "ome:Arc", "ome:ManufacturerSpec" ],
"identifier" : "LightSource:1",
"ome:arcType" : {
"@id" : "arcType:Xe"
},
https://gitlab.com/openmicroscopy/incubator/ome-owl/blob/master/ontology/RDF/JSON-LD/2016-06/sample/instrument_data.json#L274
Hi @joshmoore, I would imagine it should be fine to include some JSON-LD within a zarr array metadata document. I have to confess I don't fully grok the JSON-LD syntax, but I'd hope something like this was OK:
{
"zarr_format": "http://purl.org/zarr/spec/protocol/core/3.0",
"shape": [10000, 1000],
"data_type": "<f8",
"chunk_grid": {
"type": "regular",
"chunk_shape": [1000, 100]
},
"chunk_memory_layout": "C",
"chunk_codecs": [
{
"codec": "http://purl.org/zarr/spec/codec/gzip",
"level": 1
}
],
"fill_value": "NaN",
"extensions": [],
"attributes": {
"foo": 42,
"bar": "apples",
"baz": [1, 2, 3, 4],
"OME": {
// some block of OME-JSON-LD
}
}
}
Someone please correct me if this doesn't work.
Just following up on this...
(1) If/how to support storage implementations which have some "native" mechanism for storing metadata (e.g., N5's HDF5 backend).
I'm currently thinking that it's not worth the trouble to try to accommodate the way the existing N5 HDF5 backend stores metadata. This is simply because the flat name/value pair model for metadata is very restrictive, and not rich enough to express some of the basic things we want to express in the core metadata, or which some applications might want to store in user metadata (like the OME example). So I'm not planning to make any spec changes to accommodate this. Please push back if anyone disagrees.
(2) If/how to support alternative encodings of metadata (e.g., MessagePack instead of JSON).
This is something I can see the potential value of, at least how to leave the door open for this to be explored. However, I don't want to overcomplicate the core spec, so I won't try to accommodate this currently, unless someone specifically asks for it.
FWIW the way Zarr handles this problem today is to provide a way for users to copy from Zarr to HDF5. IMHO it seems reasonable to continue with that strategy going forward.
As to using an alternative to JSON, we would be interested in this. In particular protobuf came up as an interesting option.
As to using an alternative to JSON, we would be interested in this. In particular protobuf came up as an interesting option.
Using protobuf it should certainly be possible to express all of the core metadata. One question would be how it would handle user attributes, where you cannot predefine the schema ahead of time. But maybe that can be worked around somehow. In any case, I'd be happy to figure out how to write the spec to allow for alternative metadata encodings.
Interestingly looks like Arrow are using flatbuffers. Flatbuffers seem easier to accommodate than protobuf because of the support for unions. I'm thinking we could keep JSON as the canonical format, but could also create a flatbuffers schema for the core metadata, if only to know it was possible, i.e., to check we hadn't come up with a metadata structure that was hard to encode in something other than JSON.
Someone please correct me if this doesn't work.
Yes, as long as there is a place to "embed" a JSON tree, I assume I can make it work. (Note: that could also be another file if that's preferable)
Just to say I've done some work on the v3.0 core protocol spec in the development branch to provide a mechanism for alternative metadata encodings to be defined and used, more info in this comment. Note that this does not address the original request in this issue from @axtimwalde to provide a mechanism to support native storage of metadata, e.g., in an HDF5 backend. However, it would provide a mechanism to support use of encodings like flatbuffers or msgpack. Comments very welcome, just food for discussion.
see also #141 and #81