ome-zarr-py icon indicating copy to clipboard operation
ome-zarr-py copied to clipboard

Initial validate support, with --strict option

Open will-moore opened this issue 2 years ago • 15 comments

Work in progress...

I am testing the JSON schemas from ngff repo, downloading them to a cache with https://github.com/allenai/cached_path. This PR adds a validate command that validates the JSON at each Node (currently only working for Multiscales spec).

NB: some validation occurs during the init of the Nodes e.g. axes matching xyzct etc. so if that fails then we don't even get to validate against the schemas.

We use --strict flag to use strict_ schemas to show warnings where strict rules are not met.

cc @glyg

$ ome_zarr validate https://s3.embassy.ebi.ac.uk/idr/zarr/v0.1/6001247.zarr --strict
WARNING:ome_zarr.io:version mismatch: detected:FormatV01, requested:FormatV03
WARNING:ome_zarr.io:version mismatch: detected:FormatV03, requested:FormatV01
WARNING:ome_zarr.io:version mismatch: detected:FormatV01, requested:FormatV03
WARNING:ome_zarr.io:version mismatch: detected:FormatV03, requested:FormatV01
Validating Multiscales spec at https://s3.embassy.ebi.ac.uk/idr/zarr/v0.1/6001247.zarr/ [zgroup]
Using Multiscales schema version 0.1
WARNING 'name' is a required property
WARNING 'type' is a required property
WARNING 'metadata' is a required property
Validating Multiscales spec at https://s3.embassy.ebi.ac.uk/idr/zarr/v0.1/6001247.zarr/labels/masks/ [zgroup]
Using Multiscales schema version 0.1
WARNING 'name' is a required property
WARNING 'type' is a required property
WARNING 'metadata' is a required property

will-moore avatar Dec 02 '21 12:12 will-moore

Codecov Report

Merging #142 (c03b0a0) into master (d729a38) will increase coverage by 0.49%. The diff coverage is 95.94%.

@@            Coverage Diff             @@
##           master     #142      +/-   ##
==========================================
+ Coverage   83.90%   84.39%   +0.49%     
==========================================
  Files          12       12              
  Lines        1379     1442      +63     
==========================================
+ Hits         1157     1217      +60     
- Misses        222      225       +3     
Impacted Files Coverage Δ
ome_zarr/data.py 87.00% <ø> (ø)
ome_zarr/utils.py 90.19% <88.88%> (-0.61%) :arrow_down:
ome_zarr/reader.py 86.15% <97.77%> (+1.15%) :arrow_up:
ome_zarr/cli.py 89.77% <100.00%> (+1.16%) :arrow_up:
ome_zarr/format.py 97.59% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update d729a38...c03b0a0. Read the comment docs.

codecov[bot] avatar Dec 02 '21 14:12 codecov[bot]

Hi @will-moore thanks for the ping. I'm waiting a bit to advance on the rest of the validation, mainly because I'm not sure of what will be left to validate once you're done :) But maybe I'm wrong and the ticked list here is all there is?

If so I'll try to get that started soon

glyg avatar Dec 02 '21 14:12 glyg

Trying to capture knowledge from our historical experience with XML and XSD schemas, my understanding is that the semantics of JSON schemas is very similar on this particular issue and only distinguish between required and optional attributes. There is no real recommended concept and one reason for this is that recommended is loosely defined from a validation perspective. A few examples of open questions:

  • assuming a level exists between required and optional, is this a flat level or would some of the attributes more recommended than others?
  • is the recommended state context dependent i.e. is it the responsibility of the tool/application to make decisions on recommended attributes and report accordingly?

Brainstorming a few ideas:

  • from former discussions around minimal requirements, one idea would be to introduce different level of validations e.g. level 0 would be the absolute minimal specification (similar to https://docs.openmicroscopy.org/ome-model/latest/specifications/minimum.html), level 1 would require some attributes marked as optional in the base schema, level 2 would require more attributes...
  • is there some infrastructure that would allow someone to inject additional validation rules? The JSON schema would provide the base validation framework i.e. ensuring the syntax is valid and that the required properties are specified but e.g. some profile would allow to specify additional constraints?

sbesson avatar Dec 02 '21 14:12 sbesson

@glyg I wouldn't claim that the list there is complete - It's just what I picked up from the spec, looking for MUST rules. But not every detail of the spec is spelt out (it's possible to imagine a file that passed all those rules and was still invalid for some other reason). However, it's also likely that some of those unchecked items are already covered by ome-zarr-py. E.g. some axes validation at https://github.com/ome/ome-zarr-py/blob/a3f50286bcaebfd1d6f03f8e82c6cb9ee0310393/ome_zarr/reader.py#L287 (when you try to parse any zarr image) and https://github.com/ome/ome-zarr-py/blob/a3f50286bcaebfd1d6f03f8e82c6cb9ee0310393/ome_zarr/writer.py#L17 (currently only used when writing an OME-Zarr).

@sbesson The method I've used here (adding more restrictions to the schema via merging python dicts) seems to work so far and could be extended for various levels. It's still limited by the JSON schema validation, so the logic can't get too complex, but it seems worth exploring in more depth.

will-moore avatar Dec 02 '21 14:12 will-moore

@sbesson I've updated the approach to match the allOf method of combining schemas adopted at https://github.com/ome/ngff/pull/76

Re: required, optional and recommended: The spec uses MUST (required), SHOULD (recommended) and MAY (optional) rules. I don't think this is tool/application specific.

will-moore avatar Dec 03 '21 17:12 will-moore

This will fail checks until https://github.com/ome/ngff/pull/77 is merged and released on pypi so that ngff is pip installable (now a requirement of this PR).

will-moore avatar Dec 14 '21 16:12 will-moore

Updated to use the latest naming change to ome_ngff from https://github.com/ome/ngff/pull/77

$ ome_zarr validate https://uk1s3.embassy.ebi.ac.uk/idr/zarr/v0.3/idr0079A/9836998.zarr --warnings
WARNING:ome_zarr.io:version mismatch: detected:FormatV03, requested:FormatV04
WARNING:ome_zarr.io:version mismatch: detected:FormatV04, requested:FormatV03
WARNING:ome_zarr.io:version mismatch: detected:FormatV03, requested:FormatV04
WARNING:ome_zarr.io:version mismatch: detected:FormatV04, requested:FormatV03
WARNING:ome_zarr.reader:'metadata' is a required property
WARNING:ome_zarr.reader:'type' is a required property
WARNING:ome_zarr.reader:'name' is a required property
WARNING:ome_zarr.reader:'metadata' is a required property
WARNING:ome_zarr.reader:'type' is a required property
WARNING:ome_zarr.reader:'name' is a required property

will-moore avatar Mar 22 '22 16:03 will-moore

Added visit() as suggested by @constantinpape in c37e116

will-moore avatar Mar 24 '22 09:03 will-moore

@satra mentioned that it would be useful to add validation of Zarrs into https://github.com/dandi/dandi-cli (After I pointed out this :smile:)

Thoughts on getting this into a release?

joshmoore avatar Jun 23 '22 10:06 joshmoore

@sbesson It looks like https://ngff.openmicroscopy.org/0.4/schemas/plate.schema is 404, but this looks like the correct URL? Using https://raw.githubusercontent.com/ome/ngff/main/0.4/schemas/plate.schema for now...

will-moore avatar Jul 04 '22 22:07 will-moore

With that last commit, renamed --warnings to --strict and added support for validating Wells and Plates.

$ ome_zarr validate https://uk1s3.embassy.ebi.ac.uk/idr/zarr/v0.4/idr0056B/7361.zarr --strict
Downloading https://raw.githubusercontent.com/ome/ngff/main/0… ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 4.2/4.2 kB
Downloading https://raw.githubusercontent.com/ome/ngff/main/0… ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 568/568 bytes
WARNING:ome_zarr.reader:'name' is a required property
WARNING:ome_zarr.reader:'maximumfieldcount' is a required property

Well:

$ ome_zarr validate https://uk1s3.embassy.ebi.ac.uk/idr/zarr/v0.4/idr0056B/7361.zarr/A/1/ --strict
Downloading https://raw.githubusercontent.com/ome/ngff/main/0… ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 1.2/1.2 kB
Downloading https://raw.githubusercontent.com/ome/ngff/main/0… ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 309/309 bytes
WARNING:ome_zarr.reader:'version' is a required property

will-moore avatar Jul 04 '22 22:07 will-moore

Thanks: currently, Multiscales spec has

            self.version = multiscales[0].get(
                "version", "0.1"
            )  # should this be matched with Format.version?

whereas elsewhere I've been using CurrentFormat().version as the default instead of "0.1". OK to update that behaviour for Multiscales to use current version as default?

will-moore avatar Jul 05 '22 14:07 will-moore

OK to update that behaviour for Multiscales to use current version as default?

At lest from my side, yes. I assume that's the spirit of the comment in https://github.com/ome/ome-zarr-py/blob/6c24f139b7f24a8d2d9b98a53d73b9e1353f0576/ome_zarr/reader.py#L284 I do not see a reason why Multiscales should not be treated differently from other specs. If the consensus of the library is to default to the latest published specification in the absence of version metadata, it makes sense to apply it everywhere.

While on the topic of version retrieval, I also noticed another location https://github.com/ome/ome-zarr-py/blob/6c24f139b7f24a8d2d9b98a53d73b9e1353f0576/ome_zarr/format.py#L67-L82 which contract is to retrieve a version and which duplicates a lot of the logic in this PR. Is there a rationale for having this in seprate places or does that point at the need for some form of Spec.get_version() API that can be consumed both by Spec.validate() and by Format.matches?

sbesson avatar Jul 05 '22 14:07 sbesson

@sbesson I've used format.get_metadata_version() (no-longer private method) in the Spec.init(). I first need to create an instance of the Format class, since this is not a static method. It could be a static method, since we don't currently have a different implementation for any of the Format subclasses, but it's possible that we might want to look-up the version differently for subclasses in the future, so I didn't change it to a static method.

will-moore avatar Jul 05 '22 15:07 will-moore

@sbesson I added a --clear_cache option that removes the cached_path directory and forces a fresh download.

will-moore avatar Jul 05 '22 15:07 will-moore