jsonschema icon indicating copy to clipboard operation
jsonschema copied to clipboard

Schema parsing - Permit opaque key-values in Property schema

Open robfig opened this issue 5 years ago • 10 comments

Currently, this schema is valid according to https://www.jsonschemavalidator.net/

{
    "$id": "https://www.github.com/schemas/robfig",
    "properties": {
        "name": {
            "type": "string",
            "opaque": "something"
        }
    }
}

But it can not be unmarshaled, receiving this error:

error unmarshaling properties from json: error unmarshaling opaque from json: json: cannot unmarshal string into Go value of type jsonschema._schema

That is due to this code, which assumes any extra key must be a Definition:

	for prop, rawmsg := range valprops {
		var val Validator
		if mk, ok := DefaultValidators[prop]; ok {
			val = mk()
		} else {
			switch prop {
			// skip any already-parsed props
			case "$schema", "$id", "title", "description", "default", "examples", "readOnly", "writeOnly", "$comment", "$ref", "definitions", "format":
				continue
			default:
				// assume non-specified props are "extra definitions"
				if sch.extraDefinitions == nil {
					sch.extraDefinitions = Definitions{}
				}
				s := new(Schema)
				if err := json.Unmarshal(rawmsg, s); err != nil {
					return fmt.Errorf("error unmarshaling %s from json: %s", prop, err.Error())
				}
				sch.extraDefinitions[prop] = s
				continue
			}
		}
		if err := json.Unmarshal(rawmsg, val); err != nil {
			return fmt.Errorf("error unmarshaling %s from json: %s", prop, err.Error())
		}
		sch.Validators[prop] = val
	}

The relevant portion of the spec says that definitions must be collected under a "definitions" key https://json-schema.org/latest/json-schema-validation.html#rfc.section.9

So, it seems like the the code should be updated to pull out just the "definitions" key, and then apply the "parse the values of each key under that as a Schema".

If you agree, I will add unit tests for this case and make a pull request.

robfig avatar Apr 04 '19 22:04 robfig

@robfig That's not quite what that part of the spec means. The definitions key is the standard place to put definitions, but you are not forced to do so.

The problem here is that the spec is clear that implementations SHOULD ignore keywords they do not support.

I had thought that there was a MUST NOT error, but I guess not. I might add that in the next draft. I think it is SHOULD be ignored rather than MUST because it would be perfectly valid for an implementation to treat such keywords as extra data like title or description. Possibly unexpected or unwanted, but it wouldn't break things.

handrews avatar Apr 04 '19 23:04 handrews

So you're saying that additional definitions are not required to be collected under a "definitions" keyword, but the example from the link:

{
    "type": "array",
    "items": { "$ref": "#/definitions/positiveInteger" },
    "definitions": {
        "positiveInteger": {
            "type": "integer",
            "exclusiveMinimum": 0
        }
    }
}

could equivalently be

{
    "type": "array",
    "items": { "$ref": "#/definitions/positiveInteger" },
    "positiveInteger": {
        "type": "integer",
        "exclusiveMinimum": 0
    }
}

?

I'm having trouble reconciling that meaning with the linked section on json-schema.org, but it seems like you know what you're talking about.

In that case, I suppose the right way to allow opaque key/values at this level in the document would be to continue to attempt to parse the values as a definition, but ignore any errors that result instead of treating them as fatal.

How does that sound?

robfig avatar Apr 05 '19 14:04 robfig

it seems like you know what you're talking about.

Well, I'm the "handrews" in draft-handrews-json-schema-01 😝

The example you gave saying that those things would be equivalent is not correct, though.

$ref takes a URI, and URIs like #/definitions/positiveInteger are JSON Pointers. So that form of $ref must match the JSON object structure. $ref also MUST point to a JSON Schema (and not, say, a string or number or array, or an object with some other semantics).

I'm not sure why you are focused on definitions here, to be honest. Adding a custom extension keyword such as opaque really has nothing to do with where definitions live or how you reference them.

handrews avatar Apr 05 '19 18:04 handrews

Thanks for filing this @robfig. Totally agreed the error you've encountered is a bug that needs fixing.

I think @handrews has articulated the we should try to tackle here:

The problem here is that the spec is clear that implementations SHOULD ignore keywords they do not support.

I think you're totally right to point to the poor assumption this lib currently makes in assuming everything is a definition. We should be doing what the spec says and ignore unsupported keywords.

However, one behavior I'd like to keep is not just ignoring unrecognized keywords, but also giving them back if a user were to ever json.Marshal a jsonschema.RootSchema. Put another way, the de-serialization process shouldn't be semantically lossy. We would obviously clobber whitesapce from input data, but values encoded should also be decoded.

Thanks so much for your PR, the unit test you've provided is great, but I agree the solution needs to be a little more robust. I think we should refactor the lib to carry deserialize unknown values to opaque values instead of "extra definitions", and re-add any opaque values on json.Marshal. Sound good?

b5 avatar Apr 05 '19 19:04 b5

The mismatch in the second example between $ref and the location of the definition was unintentional. I meant it to be this:

{
    "type": "array",
    "items": { "$ref": "#/positiveInteger" },
    "positiveInteger": {
        "type": "integer",
        "exclusiveMinimum": 0
    }
}

I'm focused on definitions here because that's what this implementation is treating my opaque keyword as, instead of ignoring it as I would prefer. In your last comment, you said that it was allowed to do so.. that might be true, but it's also allowed to not fail to parse a schema with an opaque keyword, and that's arguably the more helpful behavior.

robfig avatar Apr 05 '19 19:04 robfig

@robfig I'd be happy to work with you on the change, I can write up a rough outline of a PR we'd accept if you're ok with implementing the change

b5 avatar Apr 05 '19 19:04 b5

hahah looks like we came to the same conclusion @robfig 😄

b5 avatar Apr 05 '19 19:04 b5

Hey all, i was provided with a json schema from another customer which is relatively big and has severe typos in several locations, e.g."$comments". Those are running into the here mentioned definitions:

error unmarshaling properties from json: error unmarshaling $comments from json: json: cannot unmarshal string into Go value of type jsonschema._schema

The json schema should be fixed but to get something fast of the ground it would be handy to have some kind of relaxed syntax check option.

Thanks for your work and the jsonschema lib in any case!

eqinox76 avatar Nov 19 '19 07:11 eqinox76

I'm also blocked on this issue for a project. Can I help move things forward somehow?

However, one behavior I'd like to keep is not just ignoring unrecognized keywords, but also giving them back if a user were to ever json.Marshal a jsonschema.RootSchema. Put another way, the de-serialization process shouldn't be semantically lossy.

Is it reasonable for a user to expect that json.Marshaling a parsed jsonschema.RootSchema would be lossless? Is there a well-defined use case that would require this behavior? The standard library, for example, offers no such guarantees, nor provides easy ways for users to implement it themselves.

peterbourgon avatar Dec 16 '19 16:12 peterbourgon

@b5 for some reason I didn't notice this part before:

However, one behavior I'd like to keep is not just ignoring unrecognized keywords, but also giving them back if a user were to ever json.Marshal a jsonschema.RootSchema. Put another way, the de-serialization process shouldn't be semantically lossy. We would obviously clobber whitesapce from input data, but values encoded should also be decoded.

Now that draft 2019-09 (formerly known as draft-08 but we switched to dates for meta-schema versioning) is out, you should take a look at the recommended output formats, specifically for collecting annotations (the output format is also for error reporting, and I think most of the examples focus on that- we should add more annotation examples).

We considered saying that unknown keywords SHOULD be collected as annotations, but annotation-collecting is an optional feature (we didn't want to suddenly make all existing validators non-conforming by requiring a whole new concept to be implemented). But that's basically what it sounds like you want to do, so I'd recommend taking a look at how 2019-09 specifies that should work.

handrews avatar Dec 16 '19 17:12 handrews