specs icon indicating copy to clipboard operation
specs copied to clipboard

Data Resource JSON schema definition for `schema` seems to contradict specification

Open iSnow opened this issue 4 years ago • 7 comments

Overview

Problem description

Caution: there's a lot of different uses of "schema" coming, so bear with me...

In the JSON schema definition file for a resource, data-resource.json, the formal definition for a schema property of a Resource is:

"schema": {
  "propertyOrder": 40,
  "title": "Schema",
  "description": "A schema for this resource.",
  "type": "object"
},

This allows only JSON-objects as schema definitions, essentially JSON-Strings. It disallows URLs or file paths.

This seemingly contradicts the human-readable specification of a schema property:

The value for the schema property on a resource MUST be an object representing the schema OR a string that identifies the location of the schema.

(Emphasis mine).

Proposed fix

I believe the section should read:

"oneOf": [
    {
      "title": "Schema path",
      "description": "A fully qualified URL, or a POSIX file path..",
      "type": "string"
    },
    {
      "title": "Schema as JSON",
      "description": "A schema encoded as a JSON object",
      "type": "object"
    }
]

I am no specialist for JSON schema, but if I change it, the Java implementation works with schema properties that are URLs and file paths.

How to reproduce

Reproduce the buggy behavior

  • Head to https://www.jsonschemavalidator.net/

  • copy the JSON content from https://frictionlessdata.io/schemas/data-resource.json into the "Schema" field on the left

  • Copy into "Input JSON" on the right the following JSON:

     {
      "name": "test",
      "schema": "https://raw.githubusercontent.com/frictionlessdata/datapackage-java/master/src/test/resources/fixtures/schema/population_schema.json",
      "path": "https://raw.githubusercontent.com/frictionlessdata/datapackage-java/master/src/test/resources/fixtures/data/cities.csv"
     }
    

Result: validation will fail with the following error:

Found 1 error(s) Message: Invalid type. Expected Object but got String. Schema path: #/properties/schema/type

Reproduce the proposed fix

  • Insert the proposed fix into the schema JSON
  • Run validation against the same JSON snippet

Result: no validation error will be shown.

Fix to YAML source

Created a pull request, please check and if working, merge

iSnow avatar Oct 31 '19 15:10 iSnow

Thanks @iSnow! @rufuspollock + @pwalsh what do you think?

lwinfree avatar Oct 31 '19 17:10 lwinfree

@iSnow @lwinfree It is one of the confusing parts in the connection between the specs and the implementations.

The idea behind this decision is that this JSON schema has to be used after:

  • descriptor retrieval
  • descriptor dereferencing (so-called) - loading and parsing JSON objects

For example in Python:

# Process descriptor
descriptor = helpers.retrieve_descriptor(descriptor)
descriptor = helpers.dereference_resource_descriptor(descriptor, base_path)
# validate the descriptor against JSON schema

Because of these preparational steps, there is a guarantee that resource.schema and resource.dialect are objects.

It works fine for dynamic languages but probably creates problems for static languages like Java (I'm not sure). If it does I would recommend adding string as an option to JSON schema types.

PS. The word schema was the most confusing word when I started working on the project. It still can be confusing for me so I created a rule for myself that I say profile when I talk about JSON Schema and schema when I talk about FrictionlessData Table Schema. Some of our libraries use this naming convention e.g. having profiles folders with JSON schemas.

roll avatar Nov 01 '19 12:11 roll

I think @roll has summarized this well. It is a tricky one since obviously there are times a user wants to validate a data package "as is" without dereferencing but that then makes the json schema "profile" very flexible.

rufuspollock avatar Nov 02 '19 15:11 rufuspollock

The idea behind this decision is that this JSON schema has to be used after:

descriptor retrieval

descriptor dereferencing (so-called) - loading and parsing JSON objects

In that scenario, obviously we'll only have objects there. I don't quite understand the stipulation that the schema is to be applied after those steps, though. IMO, this kind of undermines the rigidity of having a JSON schema in the first place, as general-purpose schema validators like https://www.jsonschemavalidator.net/ will mark perfectly valid DataPackages as invalid.

I'd always validate first and after that, de-reference the dependencies as part of the parsing step. Is it because a validator might not load the resources, either because they are local to the file system or will not want to load any old URL, and therefore isn't able to fully validate the schema?

If that's the rationale, there should be an explanation (or maybe I just missed it). The Java code I took over does not do resource de-referencing before validation, but will first validate the DataPackge descriptor JSON and then lazy validate each resource if/when it is queried - which is how I fell over the issue mentioned above. It's of course perfectly possible in Java to first resolve the URL's/file descriptors and only then go for validation, but the behavior of the Java implementation seems more correct to me.

iSnow avatar Nov 02 '19 17:11 iSnow

I don't quite understand the stipulation that the schema is to be applied after those steps

I think the point was to validate a schema also because it can turn a data package to be invalid after dereferencing. I agree this area is not handled very well in the specs and probably some kind of composite approach is better when e.g. validation happens on different levels (package/resource/schema) although the current approach seems the easiest to do the trick.

PS. Actually, I'm not sure what downside is of having object OR url-or-path as a type for schema/dialect. It seems more correct regarding generic validation tools.

roll avatar Nov 04 '19 11:11 roll

Actually, I'm not sure what downside is of having object OR url-or-path as a type for schema/dialect. It seems more correct regarding generic validation tools.

@roll and @rufuspollock if I may take this up again, could we extend the specs like I proposed? I can't see any downside either and it would make my life on the Java side much easier.

As it stands, a lot of the code I inherited simply switched off validation because of this issue - I'd really like to get this to a better state of things.

As far as I can tell, it would not impact the JS or Python versions at all, what do you say?

iSnow avatar Nov 25 '19 15:11 iSnow

Does the path need to be to a JSON file with schema, or schema in a yaml file is good too?

krassowski avatar Jun 17 '21 11:06 krassowski

FIXED in https://github.com/frictionlessdata/specs/issues/645

roll avatar Jan 03 '24 11:01 roll