extra fields in metadata
As far as I can tell, all of the metadata in the spec permits additional fields to be added. Should we impose any restrictions on this? My concern is that if we allow implementations unlimited additions to the ome-ngff metadata, then users will be tempted to add fields, but as those fields are not explicitly part of the spec, there will be no guarantee that those fields survive a round-trip through arbitrary ome-ngff-compatible tools.
For something like multiscale metadata, there's already an unconstrained metadata field in which users or applications can put anything they want. Given this option, it seems reasonable to then require that users refrain from adding extra fields to the rest of the multiscales dictionaries.
Summary of discussion from #324
1. Current State and Interpretation
- The specification does not explicitly forbid extra keys in JSON objects. It only defines required keys (e.g., "object 'foo' MUST have key 'bar'").
- Inconsistency in implementations: Some tools prohibit extra keys, while others allow them.
- The PR's original goal was to align the spec with existing examples, but it triggered a broader discussion about extra keys.
2. Arguments For Allowing Extra Keys
- Backward Compatibility: Allows existing data to remain valid if a key is removed in future versions.
- Flexibility: Users and implementations may need to store custom or transitional metadata.
- Practical Reality: Some tools (e.g.,
ngff-zarr) already add extra keys (e.g.,@type). Banning them could break existing workflows.
3. Arguments Against Allowing Extra Keys
- Future-Proofing: Arbitrary keys could conflict with future spec updates.
Example: A user-defined
color: "blue"could clash with a future spec'scolorfield. - Namespace Pollution: Unofficial metadata in the
omenamespace could complicate maintenance and validation. - Backward Incompatibility Risk: Data with arbitrary keys may become invalid if the spec later defines those keys.
4. Proposed Transition Plan
0.x Versions (Current and Near-Term)
- Allow extra keys but discourage their use.
- Implementations:
- MUST NOT treat extra keys as errors.
- SHOULD warn users about extra keys, especially during irreversible operations (e.g., resaving data).
- Example language:
"Unless otherwise stated, a JSON object defined in this document MAY contain additional keys beyond those enumerated in its definition. Implementations MUST NOT treat extra keys as errors but MAY warn users."
1.x Versions (Future)
- Prohibit extra keys in protected namespaces (e.g.,
ome) to ensure backward compatibility. - Users:
- MUST NOT add arbitrary fields inside the
omeobject or sub-objects. - Should store custom metadata outside the
omefield or use a reserved prefix (e.g.,omezarr.).
- MUST NOT add arbitrary fields inside the
- Implementations:
- SHOULD prevent users from adding arbitrary fields where possible.
5. Implementation Guidance
- Implementations should declare how they handle extra keys (ignore, warn, or propagate).
- The spec should avoid removing and re-adding keys to minimize conflicts.
6. Consensus and Next Steps
- The discussion was considered too broad for the current PR, which focused on aligning spec text with examples.
- A separate issue (#209) was suggested to finalize the policy on extra keys.
- The proposed permissive language was removed from this PR to keep it focused.
Open Questions
- Should the spec explicitly ban extra keys in 1.0, or just strongly discourage them?
- How should the spec handle existing data with extra keys during transitions?
- Should implementation behavior (e.g., warnings, errors) be part of the spec or left to documentation?
Note: The debate reflects the balance between flexibility for users and strictness for future compatibility. The community leans toward a gradual transition—permissive in 0.x, stricter in 1.0—with clear guidance for both users and implementations.
Thank you, chatbot.
Inconsistency in implementations: Some tools prohibit extra keys, while others allow them.
More pertinently, some tools use extra keys in metadata (https://ngff-zarr.readthedocs.io/en/latest/)
So, maybe one requirement for tools to write extra metadata could be to prepend every custom metadata-field by the name/version of the tool that wrote it?
that would prevent multiple separate tools from working with the same extra metadata field.
Something like this?
{
"metadata": [
{
"tool": "ngff-zarr",
"tool-version": "x.x.x",
"...tool metadata":
{
"key": "value"
}
}
]
}
Edit: This could be the required structure for extra json keys from 1.0.0 onwards. Until then the spec could be kept in the current state of neither allowing/permitting extra keys with a layout of the roadmap that tells users of how and when permissions will change.
There will be breaking changes at that point (but maybe these are inevitable by that time?) but it should make maintaining backwards compatibility much easier/safer afterwards.
The basic problem is not figuring out which tool wrote an extra field. It's more basic: how should readers handle extra fields at all (whether those fields disclose where they came from, or not).
I think there are 4 options:
- Readers raise an error when encountering unknown fields
- Readers ignore unknown fields
- Readers pass unknown fields through (they read read them, but make no effort to interpret those fields)
- Unknown fields are allowed but only if they comply with a specific structure, and all OME-Zarr readers MUST handle unknown fields with this structure. Maybe valid unknown fields MUST have a
"README"key that MUST be a string, etc.
For a simple reader, the first option is the easiest. For writers who feel constrained by the OME-Zarr model, 3 and 4 would give them a way to use the model with augmentations. But it places a greater burden on readers.
Given that zarr arrays and groups already have a place to put extra fields (the rest of the attributes namespace), I'm not sure how valuable options 2-4 are. But if every implementation author is happy writing extra code to handle unknown extra keys, then maybe 2 or 3 or even 4 look good.
to guide any spec changes, we should first ask people who are actively using additional fields what their use case is.
That will lead to a conversation about whether the same use case could be supported in a way that's easier for simple readers (e.g., moving those external fields to a separate namespace in "attributes").
If there's no way for these users to do without adding their own keys to metadata, then it might make sense to bake semantics for metadata extensibility into the ome-zarr spec. Otherwise, my preference would be to disallow extra keys, in the interest of keeping the spec simpler for readers.
But if every implementation author is happy writing extra code to handle unknown extra keys, then maybe 2 or 3 or even 4 look good.
The attraction of option 2 is that implementation authors don't have to write extra code to handle unknown extra keys.
If a tool is only a reader and doesn't edit or write the data elsewhere, then option 2 is the simplest and the most common behaviour among existing tools AFAIK.
I think you are always going to get some tools that choose to do that. E.g. if I have a very light-weight tool that just generates thumbnails, it's not going to try to validate every field in the OME-Zarr and raise an exception if something extra is found.
Re: "people who are actively using additional fields"...
I guess you could consider both the bioformats2raw.layout and the omero metadata to be cases where those tools wrote ~namespaced data. I'm more familiar with the latter, but I think in both cases, the motivation was to preserve metadata that was useful to other tools reading the images.
In the case of bioformats2raw.layout, the /OME/METADATA.ome.xml was originally used by rawtoometiff tool to pass that metadata on into OME.tiff.
For omero, this was used to preserve channel and rendering info when exporting data from OMERO.
In both cases, the convention has been documented in the "transitional" spec and other tools have evolved to read and write this metadata.
@will-moore you raise a good point with the thumbnail example -- some tools might not actually care about fully modelling the OME-Zarr hierarchy, and instead they query the metadata for a few particular keys. The only problem for these tools is failing to find the particular keys they care about, which makes such tools totally agnostic to extra keys.
On the other hand, supporting extra keys does require extra code for implementations that try to make a complete model of OME-Zarr data.
For readers, I lean to Option 2. I feel like the effort of trying to interpret or make sense of undocumented metadata shouldn't fall on the developer of readers. At the same time, I wouldn't want my reader to fail simply because it encounters something unexpected.
From an end-user perspective, I would imagine to be put off a bit if I were to convert my data into a format that is supposed to be "universal" and then have the next-best reader fail because it encountered some unknown metadata.
The omero metadata is actually a good example of particular metadata that is relevant to some tools, but not to others. If there is ever something like pluin-specifications, I could see that going there.
omero was documented in the spec, so it's an optional key, not an extra key.
also worth keeping in mind -- the conversation about readers is a little different than the conversation about writers. We might want readers to tolerate things we ask writers not to do. So it would be consistent to say that readers should allow extra keys, but writers should not create them. This keeps runtime errors low (because reading is fault-tolerant), but also encourages consistent metadata (because writing is strict).
Suggestion:
Extra keys in JSON objects
Unless otherwise stated, a JSON object defined in this document MAY contain additional keys beyond those enumerated in its definition. The presence of an additional key in such a JSON object MUST NOT be treated as an error by implementations.
Whether an implementation ignores the extra keys or propagates them is up to that particular implementation. It is recommended that implementations clearly declare how they handle extra keys, with a particular emphasis on informing users in the context of irreversible operations. For example, an implementation that resaves OME-Zarr data and ignores extra keys could implement a routine that warns users when they attempt to resave OME-Zarr data with extra keys.
In specifications version beyond 1.0.0, extra keys MUST NOT be written by implementations under the keys defined in this document. If implementations need to write additional metadata beyond those enumerated in this document, they MUST use a different namespace. Under such extra namespaces, metadata follows the following conventions:
- The group below the extra key MUST provide a key
tool, which identifies the tool or library that generated the extra metadata.- The group below the extra key MUST provide a key
tool-version, which identifies the version of the tool or library that generated the extra metadata.- The group below the extra key MUST provide a key
<tool>-metadata, which contains a JSON object with arbitrary metadata defined by the tool or library that generated the extra metadata.In versions below 1.0.0, extra keys SHOULD NOT be written under the keys defined in this document. Implementations MAY transitionally write extra keys and SHOULD provide documentation on their usage.
Not sure if namespace is a well-defined term - there's probably a better phrasing for this.
where would this text live? If the goal is to add it to the spec, I think each spec documents should be its own authority. That means making no promises about future versions (the most I would say about future specs is that "x behavior may change in a future version"), and I would say nothing about older specs -- they can speak for themselves.
to my knowledge the semantics of the spec version scheme is completely undefined, so 0.6 is "allowed" to be completely breaking relative to 0.5 (in the same way that 0.5 was totally breaking to 0.4). This limits the statements that can be made about future versions.
True that...the question would be what would come closest to a deprecation warning. The text above could be a candidate for a 0.6 version, without the references past and future versions?