Added json schema
This adds a pair of json schema schemas to the repository. One for Array metadata and one for Group metadata.
For those unfamiliar with json-schema, it's a language for validating JSON documents. You write schemas (in JSON) and tools can validate JSON objects ("instances") against that schema. For example, the following Group would be flagged as invalid, because it lacks a zarr_format field:
{
"node_type": "group",
"attributes": {
"spam": "ham",
"eggs": 42
}
}
The check-jsonschema tool can be used to validate this, but there are many alternative tools that could be used:
❯ check-jsonschema --schemafile json-schema/group.json examples/example-group/zarr.json
Schema validation errors were encountered.
examples/example-group/zarr.json::$: 'zarr_format' is a required property
Note that this only validates metadata stored within the zarr.json objects. It has no bearing on the actual data in the chunk files.
In addition to the schemas, I've included the metadata for a few examples, and have validated them against the json schema.
This is motivated by https://github.com/zarr-developers/geozarr-spec/issues/72. geozarr can define its own json schema for the additional properties it adds.
this is awesome work tom!
This is great work.
It does seem rather unfortunate to have to list all of the ids defined in the core spec redundantly in order to exclude them as valid extension names.
One idea would be to just pull in all of the schemas from zarr-extensions automatically (e.g. via a program that generates the schema), and disallow in the schema unknown IDs.
We could update zarr-extensions to include separate schemas for the core ids also. That way almost everything could be pulled in just from zarr-extensions.
It does seem rather unfortunate to have to list all of the ids defined in the core spec redundantly in order to exclude them as valid extension names.
My natural preference is for simple / dumb solutions. In this case I'm probably fine with repeating the names since CI should immediately fail if we add some new core object but forget to update the list of fields. I think it's impossible for these to get out of sync.
We could update zarr-extensions to include separate schemas for the core ids also.
I wasn't aware of the zarr-extensions repo until after I submitted this PR. My initial preference would be to keep the JSON schema in the same repository otherwise it's (even more) likely to fall out of date as the spec evolves.
But it'd be good to figure out some way to share what's already be done there (CI / tooling maybe?) with what's proposed here. I'll take a closer look when I get a chance.
It does seem rather unfortunate to have to list all of the ids defined in the core spec redundantly in order to exclude them as valid extension names.
My natural preference is for simple / dumb solutions. In this case I'm probably fine with repeating the names since CI should immediately fail if we add some new core object but forget to update the list of fields. I think it's impossible for these to get out of sync.
If something is missing from the list of exclusions then it will just also validate as an extension, meaning the configuration doesn't get checked.
Additionally, if you make a typo in an identifier it will also just be considered an extension and validate successfully.
We could update zarr-extensions to include separate schemas for the core ids also.
I wasn't aware of the
zarr-extensionsrepo until after I submitted this PR. My initial preference would be to keep the JSON schema in the same repository otherwise it's (even more) likely to fall out of date as the spec evolves.
Putting the separate schemas in this repo instead would also be fine, or zarr-extensions could even be merged into this repo.
But it'd be good to figure out some way to share what's already be done there (CI / tooling maybe?) with what's proposed here. I'll take a closer look when I get a chance.
That repo basically has the complement of what you have here represented as a schema.
If something is missing from the list of exclusions then it will just also validate as an extension, meaning the configuration doesn't get checked.
Mmm here's what I had in mind: With a diff like this that "forgets" to add default to the list of exclusions:
❯ git diff
diff --git a/json-schema/array.json b/json-schema/array.json
index c9d2085..f69dd1c 100644
--- a/json-schema/array.json
+++ b/json-schema/array.json
@@ -559,7 +559,6 @@
"type": "string",
"not": {
"enum": [
- "default",
"v2"
]
}
We get an error, thanks to that that matching both the default and extension chunk key encodings:
❯ check-jsonschema --schemafile json-schema/array.json examples/air_temperature.zarr/air/zarr.json --verbose
Schema validation errors were encountered.
examples/air_temperature.zarr/air/zarr.json::$.chunk_key_encoding: {'name': 'default', 'configuration': {'separator': '/'}} is valid under each of {'$ref': '#/$defs/extension_chunk_key_encoding'}, {'$ref': '#/$defs/default_chunk_key_encoding'}
But if I forget to add v2 instead, then this passes. Maybe that's what you were saying? I guess if we have 100% coverage of the core spec then we'd be OK...
What do you think about a tool that checks that we didn't forget any keys, rather than generating the json-schema files? That sounds pretty straightforward to write and run in CI.
Additionally, if you make a typo in an identifier it will also just be considered an extension and validate successfully.
Yeah, that seems like a problem... But the core schema can't know anything about the extension schemas, I think. I'm not an expert on json-schema, but perhaps this is why STAC includes a stac_extensions array in its core spec, and STAC-specific tools know to load all of the json-schemas at those URIs and validate the document against each (xref https://github.com/zarr-developers/zarr-specs/issues/316).
Thanks, @TomAugspurger! Happy to help get the permanent, resolvable URI. (My instinct is to put all of this under a v3/ directory.)