sync spec text with json examples
Dear all,
This PR is intended as a fix for #322 and #309. Some confusion was raised in these issues as to whether different version keys would be allowed in an ome-zarrs top-level zarr.json and in a following plate/well/labels groups therein.
In the current examples on the 0.5 branch (e.g., plate example), the version key is only given at the top level and not in the group-level. This is in contradiction to the current text of the plate spec, which states:
The plate dictionary MUST contain a version key whose value MUST be a string specifying the version of the plate specification.
Same for wells and labels groups examples.
Context and consequences
For context, the version key was originally moved up into the top-level ome group under the pretext of discouraging different metadata versions inside the same file according to RFC2:
While technically possible, OME-Zarr 0.5 (with Zarr v3) and OME-Zarr 0.4 (with Zarr v2) metadata could exist side-by-side in a Zarr hierarchy, it is NOT RECOMMENDED. This may be useful for short periods of time (i.e. during migrations from 0.4 to 0.5), but should not be used longer term. Multiple metadata versions can lead to conflicts, which may be hard to resolve by implementations. If implementations encounter 0.4 and 0.5 metadata side-by-side, 0.5 SHOULD be treated preferentially.
Removing the statements about the version keys in the plate/well/labels groups essentially drops the idea differing versions inside a single ome.zarr file, which I think is a reasonable step from an implementation perspective. Also, it brings the spec back into sync with the examples, which I think is important for the integrity of this project.
I modified the statement at the top-level ome group into the following to make this clear:
The OME-Zarr Metadata is stored in the various
zarr.jsonfiles throughout the above array hierarchy. The OME-Zarr Metadata version MUST be consistent within a hierarchy. The top-levelomegroup MUST contain aversionkey whose value MUST be a string specifying the version of the metadata specification.
Let me know what you think :)
cc @joshmoore @normanrz @dstansby @ziw-liu
PS: Hope I got this PR configured right - had a bit of trouble getting the knack about contributing with submodules (#323)
@will-moore is this the correct target branch for this fix? Just wondering about the failing checks.
@jo-mueller I'm not up-to-date with the strategy for PRs, or even know if this should be opened against v0.5 (which would create a v0.5.n+1 version of the spec) or if we just go for v0.6? The discussion at https://github.com/ome/ngff/issues/276 seems to have some pointers... I'll see what I can understand from that...
For existing v0.5 datasets that have version keys inside plate and well objects, will this change make them invalid OME-Zarr files?
@ziw-liu THAT is a good question. I think it's safer to just commit this into a new 0.6-dev branch. For the 0.5 branch, it might be better to simply update the examples therein and make sure that everthing is properly synced up before 0.6 comes out.
Generally I don't think there's language in the spec that prevents extra keys in the various JSON objects -- my understanding is that the spec generally describes JSON objects in positive terms (e.g., "object "foo" MUST have key "bar"), and doesn't say anything about keys an object "foo" MUST NOT have. The main effect of this change would be on any implementations that implemented the actual spec as written -- they would need to change their behavior.
Generally I don't think there's language in the spec that prevents extra keys in the various JSON objects
I agree with this observation. However in practice various implementations do prohibit extra keys. Maybe a separate clarification is needed, e.g. ome objects MAY contain keys not described in this specification.
Yes, further clarification on the presence of extra keys would be helpful. This came up yesterday here https://github.com/zarrs/ome_zarr_metadata/pull/9.
I would recommend adding language like this at the top of the spec:
Extra keys in JSON objects
Unless otherwise stated, a JSON object defined in this document MAY contain additional keys beyond those enumerated in its definition. OME-Zarr Implementations MUST allow extra keys in such JSON objects. This means that, when reading OME-Zarr data, the presence of an extra key in such a JSON object MUST NOT be treated as an error.
Whether an implementation ignores the extra keys or propagates them is up to that particular implementation. It is recommended that implementations clearly declare how they handle extra keys, with a particular emphasis on informing users in the context of irreversible operations. For example, an implementation that resaves OME-Zarr data and ignores extra keys could implement a routine that warns users when they attempt to resave OME-Zarr data with extra keys.
@d-v-b I added the proposed statement under the document conventions section. 👍
It's worth noting that if arbitrary extra keys are allowed, then this removes any future possibility of backwards compatibility between versions of the specification in future.
To give a concrete example, say someone writes data that contains a color: "blue" field somewhere in the "ome" metadata in a hypothetical version 1.2 of OME-Zarr, and then version 1.3 of OME-Zarr introduces a specifciation for the color field that says the metdata MAY have a "color" field and it MUST be a color hex string. The previous color: "blue" metadata would now be invalid. This means OME-Zarr 1.2 data could not generally also be considered valid OME-Zarr 1.3 data.
If instead arbitrary extra keys are prohibited in the "ome" metadadata, it allows the possibility of all valid metadata written with one version of the specification to also be valid with the next version.
Given OME-Zarr has reserved the top-level "ome" key in the JSON attributes, I actually now think that extra keys should be prohibited to give the option of backwards compatibility. Instead, extra keys should live outside the "ome" field.
This would also limit proliferation of "unofficial" metadata under the "ome" field, keeping the metadata there tightly controlled to what is allowed within the specifciation.
This is a good point! In terms of what is right for the future of the spec, I don't have strong feelings about whether extra keys should be allowed or not. I can definitely see the appeal of keeping the "ome" namespace tight. Users have the rest of the attributes namespace for their own stuff. On the other hand, ome-zarr might want to define some extensibiltiy semantics, in which case maybe extra keys could be allowed, as long as they can be differentiated from "official" keys (e.g., via a "omezarr." prefix or something. but that could get tedious.
But in terms of interpreting what the spec says today, I think disallowing extra keys would be potentially breaking for anyone who saved data with an extra key because they thought it was allowed.
I agree with @dstansby here that extra keys should not be allowed for 1.x versions. I don't see much harm for 0.x versions, but the question is whether we want to promote it now, if we'll disallow it later. Perhaps, we should just carve out an exception for the "version" field within the 0.5 versions.
see also: https://github.com/ome/ngff/issues/209, maybe we could get a decision here and close that issue :)
cc @thewtex since ngff-zarr saves an extra "@type" key to 0.5 multiscales metadata
Thanks all for the insight and contributions to the discussion! From what I have gathered above, an agreeable solution would be the following:
- Define "protected" keys in the json file, such as the "ome" key.
- In these protected fields, additional metadata MAY be stored transitionally. Implementations MUST NOT throw errors upon encountering these
- Starting from version x.x, (1.x?) additional metadata MUST NOT be stored in protected fields anymore. Implementations SHOULD consider encountered foreign keys as (warnings/errors)
That should get everyone plenty of time to adapt. Maybe it will also kickstart the discussion upon plug-in specifications (tables, spatialdata, etc) under protected keys outside of the "ome" key.
Neither banning nor allowing extra keys solves backwards compatibility in the general case, they just have different trade-offs. Banning additional keys breaks backwards compatibility when a key is removed from the spec (e.g. version per @ziw-liu 's comment), but allowing additional keys breaks backwards compatibility when a new key is included which someone else had already arbitarily included in their metadata (per @dstansby 's comment).
Arguably, as the spec evolves we're more likely to add keys than remove them (i.e. we should ban extra keys). However, banning additional keys causes backwards compatibility problems right now, where the problem it solves would only occur if A) we removed a field from the spec in one version, and then added it back in a future version, or B) someone is already adding arbitrary keys to their OME metadata (i.e. using it incorrectly).
IMO, the best solution is something like
Implementations SHOULD NOT raise errors (but MAY raise warnings) when additional fields are present, to allow backwards compatibility if those fields are removed in future versions of the OME-Zarr spec.
and
Users MUST NOT add arbitrary fields inside the
omeobject or sub-objects, to allow backwards compatibility if fields of the same name are added in future versions of the OME-Zarr spec.Implementations SHOULD prevent users from adding such arbitrary fields where possible.
and then spec designers try to avoid removing and then re-adding a field of the same name.
Nobody can stop a JS or python user from doing something stupid if they really try, but this at least sets expectations and guides implementations with better type systems.
Users MUST NOT add arbitrary fields inside the ome object or sub-objects, to allow backwards compatibility if fields of the same name are added in future versions of the OME-Zarr spec.
Implementations SHOULD prevent users from adding such arbitrary fields where possible.
This requires language permitting lower-level objects to relax this rule as needed, since the spec currently recommends(!) saving an unstructured dict in the multiscales metadata:
It SHOULD contain the field "metadata", which contains a dictionary with additional information about the downscaling method.
Nobody can stop a JS or python user from doing something stupid if they really try, but this at least sets expectations and guides implementations with better type systems.
IMO JS (via typescript) and python both offer everything we need to completely model this stuff. Whether people use it is a separate question.
Good point about the downside of banning keys - I agree though that the spec is more likely to add keys than remove them. If keys are removed, MUST contain could be converted to MAY contain statements, which would allow for backwards compatibility.
However, banning additional keys causes backwards compatibility problems right now
Hasn't every version of the spec so far been backwards incompatible with the previous version(s) anyway? (with few (any?) tools available to convert metadata between incompatible versions). In this pre-1.0 stage so far preserving backwards copatibility does not seem to have been a consideration.
Having language that prohibits some a particular metadata structure, but then gives implementations a get out clause to allow data to violate that metadata structure seems like a bad idea to me, and defeats the point of having a specification in the first place.
I like the suggestion in https://github.com/ome/ngff/pull/324#issuecomment-3224501418 to ban extra metadata from 1.0 onwards, to allow for backwards compatibility in the future, and in tandem with points in https://github.com/ome/ngff/pull/324#issuecomment-3228896134 that would suggest that before 1.0 we could say "additional metadata SHOULD NOT" be stored under the "ome" key" to discourage it, and then change SHOULD NOT to MUST NOT in version 1.0.
Finally, defining or recommending what implementations do is quite a new topic, and so far has not been incorporated into the specification. It is a much more general topic than just the change proposed here, so perhaps discussion about including recommendations for implementations is done elsewhere to separate concerns (and keep it easy to discuss one thing at a time)?
Maybe it makes sense to continue this discussion in a separate issue? Not to step on anyone's toes but I think the PR got a bit sidetracked from the original purpose, which was aligning spec and examples 😅 This should get some people on their way with regard to whether a version key should be present in plates/wells.
How restrictive/permissive the spec should be with respect to extra keys in the json is a matter that requires a bit more careful discussion, I think.
In order to allow this go through, I would remove the paragraph suggested by @d-v-b from the top:
Extra keys in JSON objects
Unless otherwise stated, a JSON object defined in this document MAY contain additional keys beyond those enumerated in its definition. The presence of an additional key in such a JSON object MUST NOT be treated as an error by implementations.
Whether an implementation ignores the extra keys or propagates them is up to that particular implementation. It is recommended that implementations clearly declare how they handle extra keys, with a particular emphasis on informing users in the context of irreversible operations. For example, an implementation that resaves OME-Zarr data and ignores extra keys could implement a routine that warns users when they attempt to resave OME-Zarr data with extra keys.
How restrictive/permissive the spec should be with respect to extra keys in the json is a matter that requires a bit more careful discussion, I think.
I think there are 2 topics, both of which were covered in this thread:
-
How can your PR (which removes some keys) avoid putting existing data out of spec.
I think some text that clarifies the common interpretation (that extra keys are allowed) is necessary here, because @ziw-liu asked what would happen to data already saved with the keys that are no longer described in the spec.
-
How should future versions of the spec handle extra keys. That's out of scope for this PR and IMO this discussion should continue in https://github.com/ome/ngff/issues/209.
Thanks for linking the issue! I'll write up a summary of this thread over there.
Is there a desire/need to have backwards compatibility between versions of the spec now? As far as I remember every version increment has been backwards incompatible up to now.
Hold on a minute, this isn't another proposal to silently edit (ie., change without incrementing the version number) 0.5 specification is it? If it is, please please please can we stop doing this, and make changes to the next version instead. Silently editing the specification without incrementing the version defeats the point of having a versioned spec in the first place.
good point, I wasn't even thinking in terms of versioning... how does that work again for this spec?
Presumably (because data written after this change is not compatible with the spec before this change) this
Wouldn't say so. The change just doesn't demand the version to be written in the plate/well/etc structure, but it doesn't invalidate the keys if they are there. With this change passed, the already existing version keys will just be extra keys that shouldn't throw errors if encountered by readers.
But I'm also fine with changing the target of this PR to a new 0.6dev and just leave the inadequacies of older versions in the past. That's probably the whole point of versioning...
Wouldn't say so. The change just doesn't demand the version to be written in the plate/well/etc structure, but it doesn't invalidate the keys if they are there.
I think you're saying that this change is forwards compatible (all data written before the change is valid after the change), but I'm saying that the change is not backwards compatible (not all data written after the changes is valid with the spec before the change) - in this case, if data is written without a version key, it is not valid 0.5 data (according to the first 'release' of the 0.5 specification)
(maybe I get backwards and forwards compatible the wrong way above - it always confuses me which way round they are)
As for versioning, I think the idea now would be to create a new branch (0.6dev) or something of the sort and then collect all changes for a future 0.6 there. For release, the branch should be renamed to 0.6 the branch would be added as a submodule to the main repo. The link under latest should be changed to point to the 0.6 submodule.
Maybe @joshmoore could help us there?
I think it should go in 0.5.3, because this bug originates in 0.5.
Back from vacation - maybe we can all agree that while it should or should not go into a 0.5.3 version, it should definitely be part of an upcoming 0.6? Once there is an active development head here, I'll change the target of the PR to that so we can uncouple this from the question of whether or how older versions should be modified.