specs
specs copied to clipboard
Data Resource JSON schema definition for `schema` seems to contradict specification
Overview
Problem description
Caution: there's a lot of different uses of "schema" coming, so bear with me...
In the JSON schema definition file for a resource, data-resource.json, the formal definition for a schema
property of a Resource is:
"schema": {
"propertyOrder": 40,
"title": "Schema",
"description": "A schema for this resource.",
"type": "object"
},
This allows only JSON-objects as schema
definitions, essentially JSON-Strings. It disallows URLs or file paths.
This seemingly contradicts the human-readable specification of a schema
property:
The value for the schema property on a resource MUST be an object representing the schema OR a string that identifies the location of the schema.
(Emphasis mine).
Proposed fix
I believe the section should read:
"oneOf": [
{
"title": "Schema path",
"description": "A fully qualified URL, or a POSIX file path..",
"type": "string"
},
{
"title": "Schema as JSON",
"description": "A schema encoded as a JSON object",
"type": "object"
}
]
I am no specialist for JSON schema, but if I change it, the Java implementation works with schema
properties that are URLs and file paths.
How to reproduce
Reproduce the buggy behavior
-
Head to https://www.jsonschemavalidator.net/
-
copy the JSON content from https://frictionlessdata.io/schemas/data-resource.json into the "Schema" field on the left
-
Copy into "Input JSON" on the right the following JSON:
{ "name": "test", "schema": "https://raw.githubusercontent.com/frictionlessdata/datapackage-java/master/src/test/resources/fixtures/schema/population_schema.json", "path": "https://raw.githubusercontent.com/frictionlessdata/datapackage-java/master/src/test/resources/fixtures/data/cities.csv" }
Result: validation will fail with the following error:
Found 1 error(s) Message: Invalid type. Expected Object but got String. Schema path: #/properties/schema/type
Reproduce the proposed fix
- Insert the proposed fix into the schema JSON
- Run validation against the same JSON snippet
Result: no validation error will be shown.
Fix to YAML source
Created a pull request, please check and if working, merge
Thanks @iSnow! @rufuspollock + @pwalsh what do you think?
@iSnow @lwinfree It is one of the confusing parts in the connection between the specs and the implementations.
The idea behind this decision is that this JSON schema has to be used after:
- descriptor retrieval
- descriptor dereferencing (so-called) - loading and parsing JSON objects
For example in Python:
# Process descriptor
descriptor = helpers.retrieve_descriptor(descriptor)
descriptor = helpers.dereference_resource_descriptor(descriptor, base_path)
# validate the descriptor against JSON schema
Because of these preparational steps, there is a guarantee that resource.schema
and resource.dialect
are objects.
It works fine for dynamic languages but probably creates problems for static languages like Java (I'm not sure). If it does I would recommend adding string
as an option to JSON schema types.
PS.
The word schema
was the most confusing word when I started working on the project. It still can be confusing for me so I created a rule for myself that I say profile
when I talk about JSON Schema and schema
when I talk about FrictionlessData Table Schema. Some of our libraries use this naming convention e.g. having profiles
folders with JSON schemas.
I think @roll has summarized this well. It is a tricky one since obviously there are times a user wants to validate a data package "as is" without dereferencing but that then makes the json schema "profile" very flexible.
The idea behind this decision is that this JSON schema has to be used after:
descriptor retrieval
descriptor dereferencing (so-called) - loading and parsing JSON objects
In that scenario, obviously we'll only have objects there. I don't quite understand the stipulation that the schema is to be applied after those steps, though. IMO, this kind of undermines the rigidity of having a JSON schema in the first place, as general-purpose schema validators like https://www.jsonschemavalidator.net/ will mark perfectly valid DataPackages as invalid.
I'd always validate first and after that, de-reference the dependencies as part of the parsing step. Is it because a validator might not load the resources, either because they are local to the file system or will not want to load any old URL, and therefore isn't able to fully validate the schema?
If that's the rationale, there should be an explanation (or maybe I just missed it). The Java code I took over does not do resource de-referencing before validation, but will first validate the DataPackge descriptor JSON and then lazy validate each resource if/when it is queried - which is how I fell over the issue mentioned above. It's of course perfectly possible in Java to first resolve the URL's/file descriptors and only then go for validation, but the behavior of the Java implementation seems more correct to me.
I don't quite understand the stipulation that the schema is to be applied after those steps
I think the point was to validate a schema also because it can turn a data package to be invalid after dereferencing. I agree this area is not handled very well in the specs and probably some kind of composite approach is better when e.g. validation happens on different levels (package/resource/schema) although the current approach seems the easiest to do the trick.
PS.
Actually, I'm not sure what downside is of having object
OR url-or-path
as a type for schema/dialect
. It seems more correct regarding generic validation tools.
Actually, I'm not sure what downside is of having object OR url-or-path as a type for schema/dialect. It seems more correct regarding generic validation tools.
@roll and @rufuspollock if I may take this up again, could we extend the specs like I proposed? I can't see any downside either and it would make my life on the Java side much easier.
As it stands, a lot of the code I inherited simply switched off validation because of this issue - I'd really like to get this to a better state of things.
As far as I can tell, it would not impact the JS or Python versions at all, what do you say?
Does the path need to be to a JSON file with schema, or schema in a yaml file is good too?
FIXED in https://github.com/frictionlessdata/specs/issues/645